Hi Ron,

yes naming is hard. But it will have large impact on trainings, presentations, and the mental model of users. Maybe the easiest is to collect ranking by everyone with some short justification:


My ranking (from good to not so good):

1. Refresh Table -> states what it does
2. Materialized Table -> similar to SQL materialized view but a table
3. Live Table -> nice buzzword, but maybe still too close to dynamic tables?
4. Materialized View -> a bit broader than standard but still very similar
5. Derived table -> taken by standard

Regards,
Timo



On 07.04.24 11:34, Ron liu wrote:
Hi, Dev

This is a summary letter. After several rounds of discussion, there is a
strong consensus about the FLIP proposal and the issues it aims to address.
The current point of disagreement is the naming of the new concept. I have
summarized the candidates as follows:

1. Derived Table (Inspired by Google Lookers)
     - Pros: Google Lookers has introduced this concept, which is designed
for building Looker's automated modeling, aligning with our purpose for the
stream-batch automatic pipeline.

     - Cons: The SQL standard uses derived table term extensively, vendors
adopt this for simply referring to a table within a subclause.

2. Materialized Table: It means materialize the query result to table,
similar to Db2 MQT (Materialized Query Tables). In addition, Snowflake
Dynamic Table's predecessor is also called Materialized Table.

3. Updating Table (From Timo)

4. Updating Materialized View (From Timo)

5. Refresh/Live Table (From Martijn)

As Martijn said, naming is a headache, looking forward to more valuable
input from everyone.

[1]
https://cloud.google.com/looker/docs/derived-tables#persistent_derived_tables
[2] https://www.ibm.com/docs/en/db2/11.5?topic=tables-materialized-query
[3]
https://community.denodo.com/docs/html/browse/6.0/vdp/vql/materialized_tables/creating_materialized_tables/creating_materialized_tables

Best,
Ron

Ron liu <ron9....@gmail.com> 于2024年4月7日周日 15:55写道:

Hi, Lorenzo

Thank you for your insightful input.

I think the 2 above twisted the materialized view concept to more than
just an optimization for accessing pre-computed aggregates/filters.
I think that concept (at least in my mind) is now adherent to the
semantics of the words themselves ("materialized" and "view") than on its
implementations in DBMs, as just a view on raw data that, hopefully, is
constantly updated with fresh results.
That's why I understand Timo's et al. objections.

Your understanding of Materialized Views is correct. However, in our
scenario, an important feature is the support for Update & Delete
operations, which the current Materialized Views cannot fulfill. As we
discussed with Timo before, if Materialized Views needs to support data
modifications, it would require an extension of new keywords, such as
CREATING xxx (UPDATING) MATERIALIZED VIEW.

Still, I don't understand why we need another type of special table.
Could you dive deep into the reasons why not simply adding the FRESHNESS
parameter to standard tables?

Firstly, I need to emphasize that we cannot achieve the design goal of
FLIP through the CREATE TABLE syntax combined with a FRESHNESS parameter.
The proposal of this FLIP is to use Dynamic Table + Continuous Query, and
combine it with FRESHNESS to realize a streaming-batch unification.
However, CREATE TABLE is merely a metadata operation and cannot
automatically start a background refresh job. To achieve the design goal of
FLIP with standard tables, it would require extending the CTAS[1] syntax to
introduce the FRESHNESS keyword. We considered this design initially, but
it has following problems:

1. Distinguishing a table created through CTAS as a standard table or as a
"special" standard table with an ongoing background refresh job using the
FRESHNESS keyword is very obscure for users.
2. It intrudes on the semantics of the CTAS syntax. Currently, tables
created using CTAS only add table metadata to the Catalog and do not record
attributes such as query. There are also no ongoing background refresh
jobs, and the data writing operation happens only once at table creation.
3. For the framework, when we perform a certain kind of Alter Table
behavior for a table, for the table created by specifying FRESHNESS and did
not specify the FRESHNESS created table behavior how to distinguish , which
will also cause confusion.

In terms of the design goal of combining Dynamic Table + Continuous Query,
the FLIP proposal cannot be realized by only extending the current stardand
tables, so a new kind of dynamic table needs to be introduced at the
first-level concept.

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/create/#as-select_statement

Best,
Ron

<lorenzo.affe...@ververica.com.invalid> 于2024年4月3日周三 22:25写道:

Hello everybody!
Thanks for the FLIP as it looks amazing (and I think the prove is this
deep discussion it is provoking :))

I have a couple of comments to add to this:

Even though I get the reason why you rejected MATERIALIZED VIEW, I still
like it a lot, and I would like to provide pointers on how the materialized
view concept twisted in last years:

• Materialize DB (https://materialize.com/)
• The famous talk by Martin Kleppmann "turning the database inside out" (
https://www.youtube.com/watch?v=fU9hR3kiOK0)

I think the 2 above twisted the materialized view concept to more than
just an optimization for accessing pre-computed aggregates/filters.
I think that concept (at least in my mind) is now adherent to the
semantics of the words themselves ("materialized" and "view") than on its
implementations in DBMs, as just a view on raw data that, hopefully, is
constantly updated with fresh results.
That's why I understand Timo's et al. objections.
Still I understand there is no need to add confusion :)

Still, I don't understand why we need another type of special table.
Could you dive deep into the reasons why not simply adding the FRESHNESS
parameter to standard tables?

I would say that as a very seamless implementation with the goal of a
unification of batch and streaming.
If we stick to a unified world, I think that Flink should just provide 1
type of table that is inherently dynamic.
Now, depending on FRESHNESS objectives / connectors used in WITH, that
table can be backed by a stream or batch job as you explained in your FLIP.

Maybe I am totally missing the point :)

Thank you in advance,
Lorenzo
On Apr 3, 2024 at 15:25 +0200, Martijn Visser <martijnvis...@apache.org>,
wrote:
Hi all,

Thanks for the proposal. While the FLIP talks extensively on how
Snowflake
has Dynamic Tables and Databricks has Delta Live Tables, my
understanding
is that Databricks has CREATE STREAMING TABLE [1] which relates with
this
proposal.

I do have concerns about using CREATE DYNAMIC TABLE, specifically about
confusing the users who are familiar with Snowflake's approach where you
can't change the content via DML statements, while that is something
that
would work in this proposal. Naming is hard of course, but I would
probably
prefer something like CREATE CONTINUOUS TABLE, CREATE REFRESH TABLE or
CREATE LIVE TABLE.

Best regards,

Martijn

[1]

https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-streaming-table.html

On Wed, Apr 3, 2024 at 5:19 AM Ron liu <ron9....@gmail.com> wrote:

Hi, dev

After offline discussion with Becket Qin, Lincoln Lee and Jark Wu, we
have
improved some parts of the FLIP.

1. Add Full Refresh Mode section to clarify the semantics of full
refresh
mode.
2. Add Future Improvement section explaining why query statement does
not
support references to temporary view and possible solutions.
3. The Future Improvement section explains a possible future solution
for
dynamic table to support the modification of query statements to meet
the
common field-level schema evolution requirements of the lakehouse.
4. The Refresh section emphasizes that the Refresh command and the
background refresh job can be executed in parallel, with no
restrictions at
the framework level.
5. Convert RefreshHandler into a plug-in interface to support various
workflow schedulers.

Best,
Ron

Ron liu <ron9....@gmail.com> 于2024年4月2日周二 10:28写道:

Hi, Venkata krishnan

Thank you for your involvement and suggestions, and hope that the
design
goals of this FLIP will be helpful to your business.

1. In the proposed FLIP, given the example for the
dynamic table, do
the
data sources always come from a single lake storage such as
Paimon or
does
the same proposal solve for 2 disparate storage systems like
Kafka and
Iceberg where Kafka events are ETLed to Iceberg similar to Paimon?
Basically the lambda architecture that is mentioned in the FLIP
as well.
I'm wondering if it is possible to switch b/w sources based on the
execution mode, for eg: if it is backfill operation, switch to a
data
lake
storage system like Iceberg, otherwise an event streaming system
like
Kafka.

Dynamic table is a design abstraction at the framework level and
is not
tied to the physical implementation of the connector. If a
connector
supports a combination of Kafka and lake storage, this works fine.

2. What happens in the context of a bootstrap (batch) +
nearline
update
(streaming) case that are stateful applications? What I mean by
that is,
will the state from the batch application be transferred to the
nearline
application after the bootstrap execution is complete?

I think this is another orthogonal thing, something that FLIP-327
tries
to
address, not directly related to Dynamic Table.

[1]


https://cwiki.apache.org/confluence/display/FLINK/FLIP-327%3A+Support+switching+from+batch+to+stream+mode+to+improve+throughput+when+processing+backlog+data

Best,
Ron

Venkatakrishnan Sowrirajan <vsowr...@asu.edu> 于2024年3月30日周六
07:06写道:

Ron and Lincoln,

Great proposal and interesting discussion for adding support
for dynamic
tables within Flink.

At LinkedIn, we are also trying to solve compute/storage
convergence for
similar problems discussed as part of this FLIP, specifically
periodic
backfill, bootstrap + nearline update use cases using single
implementation
of business logic (single script).

Few clarifying questions:

1. In the proposed FLIP, given the example for the dynamic
table, do the
data sources always come from a single lake storage such as
Paimon or
does
the same proposal solve for 2 disparate storage systems like
Kafka and
Iceberg where Kafka events are ETLed to Iceberg similar to
Paimon?
Basically the lambda architecture that is mentioned in the
FLIP as well.
I'm wondering if it is possible to switch b/w sources based on
the
execution mode, for eg: if it is backfill operation, switch to
a data
lake
storage system like Iceberg, otherwise an event streaming
system like
Kafka.
2. What happens in the context of a bootstrap (batch) +
nearline update
(streaming) case that are stateful applications? What I mean
by that is,
will the state from the batch application be transferred to
the nearline
application after the bootstrap execution is complete?

Regards
Venkata krishnan


On Mon, Mar 25, 2024 at 8:03 PM Ron liu <ron9....@gmail.com>
wrote:

Hi, Timo

Thanks for your quick response, and your suggestion.

Yes, this discussion has turned into confirming whether
it's a special
table or a special MV.

1. The key problem with MVs is that they don't support
modification,
so
I
prefer it to be a special table. Although the periodic
refresh
behavior
is
more characteristic of an MV, since we are already a
special table,
supporting periodic refresh behavior is quite natural,
similar to
Snowflake
dynamic tables.

2. Regarding the keyword UPDATING, since the current
Regular Table is
a
Dynamic Table, which implies support for updating through
Continuous
Query,
I think it is redundant to add the keyword UPDATING. In
addition,
UPDATING
can not reflect the Continuous Query part, can not express
the purpose
we
want to simplify the data pipeline through Dynamic Table +
Continuous
Query.

3. From the perspective of the SQL standard definition, I
can
understand
your concerns about Derived Table, but is it possible to
make a slight
adjustment to meet our needs? Additionally, as Lincoln
mentioned, the
Google Looker platform has introduced Persistent Derived
Table, and
there
are precedents in the industry; could Derived Table be a
candidate?

Of course, look forward to your better suggestions.

Best,
Ron



Timo Walther <twal...@apache.org> 于2024年3月25日周一 18:49写道:

After thinking about this more, this discussion boils
down to
whether
this is a special table or a special materialized
view. In both
cases,
we would need to add a special keyword:

Either

CREATE UPDATING TABLE

or

CREATE UPDATING MATERIALIZED VIEW

I still feel that the periodic refreshing behavior is
closer to a
MV.
If
we add a special keyword to MV, the optimizer would
know that the
data
cannot be used for query optimizations.

I will ask more people for their opinion.

Regards,
Timo


On 25.03.24 10:45, Timo Walther wrote:
Hi Ron and Lincoln,

thanks for the quick response and the very
insightful discussion.

we might limit future opportunities to
optimize queries
through automatic materialization rewriting by
allowing data
modifications, thus losing the potential for
such
optimizations.

This argument makes a lot of sense to me. Due to
the updates, the
system
is not in full control of the persisted data.
However, the system
is
still in full control of the job that powers the
refresh. So if
the
system manages all updating pipelines, it could
still leverage
automatic
materialization rewriting but without leveraging
the data at rest
(only
the data in flight).

we are considering another candidate, Derived
Table, the term
'derive'
suggests a query, and 'table' retains
modifiability. This
approach
would not disrupt our current concept of a
dynamic table

I did some research on this term. The SQL standard
uses the term
"derived table" extensively (defined in section
4.17.3). Thus, a
lot of
vendors adopt this for simply referring to a table
within a
subclause:





https://urldefense.com/v3/__https://dev.mysql.com/doc/refman/8.0/en/derived-tables.html__;!!IKRxdwAv5BmarQ!dVYcp9PUyjpBGzkYFxb2sdnmB0E22koc-YLdxY2LidExEHUJKRkyvRbAveqjlYFKWevFvmE1Z-j735ghdiMp$






https://urldefense.com/v3/__https://infocenter.sybase.com/help/topic/com.sybase.infocenter.dc32300.1600/doc/html/san1390612291252.html__;!!IKRxdwAv5BmarQ!dVYcp9PUyjpBGzkYFxb2sdnmB0E22koc-YLdxY2LidExEHUJKRkyvRbAveqjlYFKWevFvmE1Z-j737h1gRux$






https://urldefense.com/v3/__https://www.c-sharpcorner.com/article/derived-tables-vs-common-table-expressions/__;!!IKRxdwAv5BmarQ!dVYcp9PUyjpBGzkYFxb2sdnmB0E22koc-YLdxY2LidExEHUJKRkyvRbAveqjlYFKWevFvmE1Z-j739bWIEcL$






https://urldefense.com/v3/__https://stackoverflow.com/questions/26529804/what-are-the-derived-tables-in-my-explain-statement__;!!IKRxdwAv5BmarQ!dVYcp9PUyjpBGzkYFxb2sdnmB0E22koc-YLdxY2LidExEHUJKRkyvRbAveqjlYFKWevFvmE1Z-j739HnGtQf$





https://urldefense.com/v3/__https://www.sqlservercentral.com/articles/sql-derived-tables__;!!IKRxdwAv5BmarQ!dVYcp9PUyjpBGzkYFxb2sdnmB0E22koc-YLdxY2LidExEHUJKRkyvRbAveqjlYFKWevFvmE1Z-j737DeBiqg$

Esp. the latter example is interesting, SQL Server
allows things
like
this on derived tables:

UPDATE T SET Name='Timo' FROM (SELECT * FROM
Product) AS T

SELECT * FROM Product;

Btw also Snowflake's dynamic table state:

Because the content of a dynamic table is
fully determined
by the given query, the content cannot be
changed by using DML.
You don’t insert, update, or delete the rows
in a dynamic
table.

So a new term makes a lot of sense.

How about using `UPDATING`?

CREATE UPDATING TABLE

This reflects that modifications can be made and
from an
English-language perspective you can PAUSE or
RESUME the UPDATING.
Thus, a user can define UPDATING interval and mode?

Looking forward to your thoughts.

Regards,
Timo


On 25.03.24 07:09, Ron liu wrote:
Hi, Ahmed

Thanks for your feedback.

Regarding your question:

I want to iterate on Timo's comments
regarding the confusion
between
"Dynamic Table" and current Flink "Table".
Should the refactoring
of
the
system happen in 2.0, should we rename it in
this Flip ( as the
suggestions
in the thread ) and address the holistic
changes in a separate
Flip
for 2.0?

Lincoln proposed a new concept in reply to
Timo: Derived Table,
which
is a
combination of Dynamic Table + Continuous
Query, and the use of
Derived
Table will not conflict with existing concepts,
what do you
think?

I feel confused with how it is further with
other components,
the
examples provided feel like a standalone ETL
job, could you
provide in
the
FLIP an example where the table is further used
in subsequent
queries
(specially in batch mode).

Thanks for your suggestion, I added how to use
Dynamic Table in
FLIP
user
story section, Dynamic Table can be referenced
by downstream
Dynamic
Table
and can also support OLAP queries.

Best,
Ron

Ron liu <ron9....@gmail.com> 于2024年3月23日周六
10:35写道:

Hi, Feng

Thanks for your feedback.

Although currently we restrict users from
modifying the query,
I
wonder
if
we can provide a better way to help users
rebuild it without
affecting
downstream OLAP queries.

Considering the problem of data consistency,
so in the first
step
we
are
strictly limited in semantics and do not
support modify the
query.
This is
really a good problem, one of my ideas is to
introduce a syntax
similar to
SWAP [1], which supports exchanging two
Dynamic Tables.

 From the documentation, the definitions
SQL and job
information
are
stored in the Catalog. Does this mean that
if a system needs to
adapt
to
Dynamic Tables, it also needs to store
Flink's job information
in
the
corresponding system?
For example, does MySQL's Catalog need to
store flink job
information
as
well?

Yes, currently we need to rely on Catalog to
store refresh job
information.

Users still need to consider how much
memory is being used, how
large
the concurrency is, which type of state
backend is being used,
and
may need
to set TTL expiration.

Similar to the current practice, job
parameters can be set via
the
Flink
conf or SET commands

When we submit a refresh command, can we
help users detect if
there
are
any
running jobs and automatically stop them
before executing the
refresh
command? Then wait for it to complete before
restarting the
background
streaming job?

Purely from a technical implementation point
of view, your
proposal
is
doable, but it would be more costly. Also I
think data
consistency
itself
is the responsibility of the user, similar
to how Regular Table
is
now also
the responsibility of the user, so it's
consistent with its
behavior
and no
additional guarantees are made at the engine
level.

Best,
Ron


Ahmed Hamdy <hamdy10...@gmail.com>
于2024年3月22日周五 23:50写道:

Hi Ron,
Sorry for joining the discussion late,
thanks for the effort.

I think the base idea is great, however I
have a couple of
comments:
- I want to iterate on Timo's comments
regarding the confusion
between
"Dynamic Table" and current Flink
"Table". Should the
refactoring of
the
system happen in 2.0, should we rename it
in this Flip ( as the
suggestions
in the thread ) and address the holistic
changes in a separate
Flip
for
2.0?
- I feel confused with how it is further
with other components,
the
examples provided feel like a standalone
ETL job, could you
provide
in the
FLIP an example where the table is
further used in subsequent
queries
(specially in batch mode).
- I really like the standard of keeping
the unified batch and
streaming
approach
Best Regards
Ahmed Hamdy


On Fri, 22 Mar 2024 at 12:07, Lincoln Lee
<
lincoln.8...@gmail.com>
wrote:

Hi Timo,

Thanks for your thoughtful inputs!

Yes, expanding the MATERIALIZED
VIEW(MV) could achieve the
same
function,
but our primary concern is that by
using a view, we might
limit
future
opportunities
to optimize queries through automatic
materialization
rewriting
[1],
leveraging
the support for MV by physical
storage. This is because we
would be
breaking
the intuitive semantics of a
materialized view (a materialized
view
represents
the result of a query) by allowing
data modifications, thus
losing
the
potential
for such optimizations.

With these considerations in mind, we
were inspired by Google
Looker's
Persistent
Derived Table [2]. PDT is designed for
building Looker's
automated
modeling,
aligning with our purpose for the
stream-batch automatic
pipeline.
Therefore,
we are considering another candidate,
Derived Table, the term
'derive'
suggests a
query, and 'table' retains
modifiability. This approach would
not
disrupt
our current
concept of a dynamic table, preserving
the future utility of
MVs.

Conceptually, a Derived Table is a
Dynamic Table + Continuous
Query. By
introducing
a new concept Derived Table for this
FLIP, this makes all
concepts to
play
together nicely.

What do you think about this?

[1]



https://urldefense.com/v3/__https://calcite.apache.org/docs/materialized_views.html__;!!IKRxdwAv5BmarQ!dVYcp9PUyjpBGzkYFxb2sdnmB0E22koc-YLdxY2LidExEHUJKRkyvRbAveqjlYFKWevFvmE1Z-j73_NFf4D5$
[2]







https://urldefense.com/v3/__https://cloud.google.com/looker/docs/derived-tables*persistent_derived_tables__;Iw!!IKRxdwAv5BmarQ!dVYcp9PUyjpBGzkYFxb2sdnmB0E22koc-YLdxY2LidExEHUJKRkyvRbAveqjlYFKWevFvmE1Z-j7382-2zI3$


Best,
Lincoln Lee


Timo Walther <twal...@apache.org>
于2024年3月22日周五 17:54写道:

Hi Ron,

thanks for the detailed answer.
Sorry, for my late reply, we
had a
conference that kept me busy.

In the current concept[1], it
actually includes: Dynamic
Tables
&
& Continuous Query. Dynamic
Table is just an abstract
logical
concept

This explanation makes sense to me.
But the docs also say "A
continuous
query is evaluated on the dynamic
table yielding a new
dynamic
table.".
So even our regular CREATE TABLEs
are considered dynamic
tables.
This
can also be seen in the diagram
"Dynamic Table -> Continuous
Query
->
Dynamic Table". Currently, Flink
queries can only be executed
on
Dynamic
Tables.

In essence, a materialized view
represents the result of
a
query.

Isn't that what your proposal does
as well?

the object of the suspend
operation is the refresh task
of
the
dynamic table

I understand that Snowflake uses
the term [1] to merge their
concepts
of
STREAM, TASK, and TABLE into one
piece of concept. But Flink
has
no
concept of a "refresh task". Also,
they already introduced
MATERIALIZED
VIEW. Flink is in the convenient
position that the concept of
materialized views is not taken
(reserved maybe for exactly
this
use
case?). And SQL standard concept
could be "slightly adapted"
to
our
needs. Looking at other vendors
like Postgres[2], they also
use
`REFRESH` commands so why not
adding additional commands such
as
DELETE
or UPDATE. Oracle supports "ON
PREBUILT TABLE clause tells
the
database
to use an existing table
segment"[3] which comes closer to
what we
want
as well.

it is not intended to support
data modification

This is an argument that I
understand. But we as Flink could
allow
data
modifications. This way we are only
extending the standard
and
don't
introduce new concepts.

If we can't agree on using
MATERIALIZED VIEW concept. We
should
fix
our
syntax in a Flink 2.0 effort.
Making regular tables bounded
and
dynamic
tables unbounded. We would be
closer to the SQL standard with
this
and
pave the way for the future. I
would actually support this if
all
concepts play together nicely.

In the future, we can consider
extending the statement
set
syntax
to
support the creation of multiple
dynamic tables.

It's good that we called the
concept STATEMENT SET. This
allows us
to
defined CREATE TABLE within. Even
if it might look a bit
confusing.

Regards,
Timo

[1]



https://urldefense.com/v3/__https://docs.snowflake.com/en/user-guide/dynamic-tables-about__;!!IKRxdwAv5BmarQ!dVYcp9PUyjpBGzkYFxb2sdnmB0E22koc-YLdxY2LidExEHUJKRkyvRbAveqjlYFKWevFvmE1Z-j73zexZBXu$
[2]






https://urldefense.com/v3/__https://www.postgresql.org/docs/current/sql-creatematerializedview.html__;!!IKRxdwAv5BmarQ!dVYcp9PUyjpBGzkYFxb2sdnmB0E22koc-YLdxY2LidExEHUJKRkyvRbAveqjlYFKWevFvmE1Z-j73zbNhvS7$
[3]



https://urldefense.com/v3/__https://oracle-base.com/articles/misc/materialized-views__;!!IKRxdwAv5BmarQ!dVYcp9PUyjpBGzkYFxb2sdnmB0E22koc-YLdxY2LidExEHUJKRkyvRbAveqjlYFKWevFvmE1Z-j739xS1kvD$

On 21.03.24 04:14, Feng Jin wrote:
Hi Ron and Lincoln

Thanks for driving this
discussion. I believe it will
greatly
improve
the
convenience of managing user
real-time pipelines.

I have some questions.

*Regarding Limitations of
Dynamic Table:*

Does not support modifying
the select statement after the
dynamic
table
is created.

Although currently we restrict
users from modifying the
query, I
wonder
if
we can provide a better way to
help users rebuild it without
affecting
downstream OLAP queries.


*Regarding the management of
background jobs:*

1. From the documentation, the
definitions SQL and job
information
are
stored in the Catalog. Does this
mean that if a system needs
to
adapt
to
Dynamic Tables, it also needs to
store Flink's job
information in
the
corresponding system?
For example, does MySQL's
Catalog need to store flink job
information
as
well?


2. Users still need to consider
how much memory is being
used,
how
large
the concurrency is, which type
of state backend is being
used,
and
may
need
to set TTL expiration.


*Regarding the Refresh Part:*

If the refresh mode is
continuous and a background job is
running,
caution should be taken with the
refresh command as it can
lead
to
inconsistent data.

When we submit a refresh
command, can we help users detect
if
there
are
any
running jobs and automatically
stop them before executing
the
refresh
command? Then wait for it to
complete before restarting the
background
streaming job?

Best,
Feng

On Tue, Mar 19, 2024 at 9:40 PM
Lincoln Lee <
lincoln.8...@gmail.com

wrote:

Hi Yun,

Thank you very much for your
valuable input!

Incremental mode is indeed an
attractive idea, we have also
discussed
this, but in the current
design,

we first provided two refresh
modes: CONTINUOUS and
FULL. Incremental mode can be
introduced

once the execution layer has
the capability.

My answer for the two
questions:

1.
Yes, cascading is a good
question. Current proposal
provides a
freshness that defines a
dynamic
table relative to the base
table’s lag. If users need to
consider
the
end-to-end freshness of
multiple
cascaded dynamic tables, he
can manually split them for
now.
Of
course, how to let multiple
cascaded
or dependent dynamic tables
complete the freshness
definition
in
a
simpler way, I think it can be
extended in the future.

2.
Cascading refresh is also a
part we focus on discussing. In
this
flip,
we hope to focus as much as
possible on the core features
(as it already involves a lot
things),
so we did not directly
introduce related
syntax. However, based on the
current design, combined
with
the
catalog and lineage,
theoretically,
users can also finish the
cascading refresh.


Best,
Lincoln Lee


Yun Tang <myas...@live.com>
于2024年3月19日周二 13:45写道:

Hi Lincoln,

Thanks for driving this
discussion, and I am so excited to
see
this
topic
being discussed in the
Flink community!

 From my point of view,
instead of the work of unifying
streaming
and
batch
in DataStream API [1],
this FLIP actually could make users
benefit
from
one
engine to rule batch &
streaming.

If we treat this FLIP as
an open-source implementation of
Snowflake's
dynamic tables [2], we
still lack an incremental refresh
mode
to
make
the
ETL near real-time with a
much cheaper computation cost.
However,
I
think
this could be done under
the current design by introducing
another
refresh
mode in the future.
Although the extra work of incremental
view
maintenance
would be much larger.

For the FLIP itself, I
have several questions below:

1. It seems this FLIP does
not consider the lag of
refreshes
across
ETL
layers from ODS ---> DWD
---> APP [3]. We currently only
consider
the
scheduler interval, which
means we cannot use lag to
automatically
schedule
the upfront micro-batch
jobs to do the work.
2. To support the
automagical refreshes, we should
consider
the
lineage
in
the catalog or somewhere
else.


[1]









https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-134*3A*Batch*execution*for*the*DataStream*API__;JSsrKysrKw!!IKRxdwAv5BmarQ!dVYcp9PUyjpBGzkYFxb2sdnmB0E22koc-YLdxY2LidExEHUJKRkyvRbAveqjlYFKWevFvmE1Z-j7352JICzI$
[2]




https://urldefense.com/v3/__https://docs.snowflake.com/en/user-guide/dynamic-tables-about__;!!IKRxdwAv5BmarQ!dVYcp9PUyjpBGzkYFxb2sdnmB0E22koc-YLdxY2LidExEHUJKRkyvRbAveqjlYFKWevFvmE1Z-j73zexZBXu$
[3]




https://urldefense.com/v3/__https://docs.snowflake.com/en/user-guide/dynamic-tables-refresh__;!!IKRxdwAv5BmarQ!dVYcp9PUyjpBGzkYFxb2sdnmB0E22koc-YLdxY2LidExEHUJKRkyvRbAveqjlYFKWevFvmE1Z-j735ghqpxk$

Best
Yun Tang



________________________________
From: Lincoln Lee <
lincoln.8...@gmail.com>
Sent: Thursday, March 14,
2024 14:35
To: dev@flink.apache.org <
dev@flink.apache.org>
Subject: Re: [DISCUSS]
FLIP-435: Introduce a New Dynamic
Table
for
Simplifying Data Pipelines

Hi Jing,

Thanks for your attention
to this flip! I'll try to answer
the
following
questions.

1. How to define query
of dynamic table?
Use flink sql or
introducing new syntax?
If use flink sql, how
to handle the difference in SQL
between
streaming
and
batch processing?
For example, a query
including window aggregate based on
processing
time?
or a query including
global order by?

Similar to `CREATE TABLE
AS query`, here the `query` also
uses
Flink
sql
and

doesn't introduce a
totally new syntax.
We will not change the
status respect to

the difference in
functionality of flink sql itself on
streaming
and
batch, for example,

the proctime window agg on
streaming and global sort on
batch
that
you
mentioned,

in fact, do not work
properly in the
other mode, so when the
user modifies the

refresh mode of a dynamic
table that is not supported, we
will
throw
an
exception.

2. Whether modify the
query of dynamic table is allowed?
Or we could only
refresh a dynamic table based on the
initial
query?

Yes, in the current
design, the query definition of the
dynamic table is not
allowed

to be modified, and you
can only refresh the data based
on
the
initial definition.

3. How to use dynamic
table?
The dynamic table seems
to be similar to the materialized
view.
Will
we
do
something like
materialized view rewriting during the
optimization?

It's true that dynamic
table and materialized view
are similar in some ways,
but as Ron

explains
there are differences. In
terms of optimization, automated
materialization discovery

similar to that supported
by calcite is also a potential
possibility,
perhaps with the

addition of automated
rewriting in the future.



Best,
Lincoln Lee


Ron liu <
ron9....@gmail.com> 于2024年3月14日周四 14:01写道:

Hi, Timo

Sorry for later
response, thanks for your feedback.
Regarding your
questions:

Flink has introduced
the concept of Dynamic Tables many
years
ago.
How

does the term "Dynamic
Table" fit into Flink's regular
tables
and
also

how does it relate to
Table API?


I fear that adding
the DYNAMIC TABLE keyword could cause
confusion
for
users, because a
term for regular CREATE TABLE (that can
be
"kind
of
dynamic" as well and
is backed by a changelog) is then
missing.
Also
given that we call
our connectors for those tables,
DynamicTableSource
and DynamicTableSink.


In general, I find
it contradicting that a TABLE can be
"paused" or
"resumed". From an
English language perspective, this
does
sound
incorrect. In my
opinion (without much research yet), a
continuous
updating trigger
should rather be modelled as a CREATE
MATERIALIZED
VIEW
(which users are
familiar with?) or a new concept such
as
a
CREATE
TASK
(that can be paused
and resumed?).


1.
In the current
concept[1], it actually includes: Dynamic
Tables
&
Continuous Query.
Dynamic Table is just an abstract
logical concept
, which in its physical
form represents either a table
or a
changelog
stream. It requires the
combination with Continuous Query
to
achieve
dynamic updates of the
target table similar to a
database’s
Materialized View.
We hope to upgrade the
Dynamic Table to a real entity
that
users
can
operate, which combines
the logical concepts of Dynamic
Tables +
Continuous Query. By
integrating the definition of tables
and
queries,
it can achieve
functions similar to Materialized Views,
simplifying
users' data processing
pipelines.
So, the object of the
suspend operation is the refresh
task of
the
dynamic table. The
command `ALTER DYNAMIC TABLE
table_name
SUSPEND
`
is actually a shorthand
for `ALTER DYNAMIC TABLE
table_name
SUSPEND
REFRESH` (if written in
full for clarity, we can also
modify
it).

2. Initially, we also
considered Materialized Views
, but ultimately
decided against them. Materialized views
are
designed
to enhance query
performance for workloads that consist
of
common,
repetitive query
patterns. In essence, a materialized
view
represents
the result of a query.
However, it is not
intended to support data modification.
For
Lakehouse scenarios,
where the ability to delete or
update
data
is
crucial (such as
compliance with GDPR, FLIP-2),
materialized
views
fall short.

3.
Compared to CREATE
(regular) TABLE, CREATE DYNAMIC TABLE
not
only
defines metadata in the
catalog but also automatically
initiates
a
data refresh task based
on the query specified during
table
creation.
It dynamically executes
data updates. Users can focus on
data
dependencies and data
generation logic.

4.
The new dynamic table
does not conflict with the existing
DynamicTableSource and
DynamicTableSink interfaces. For
the
developer,
all that needs to be
implemented is the new
CatalogDynamicTable,
without changing the
implementation of source and sink.

5. For now, the FLIP
does not consider supporting Table
API
operations
on
Dynamic Table
. However, once the SQL
syntax is finalized, we can
discuss
this
in
a
separate FLIP.
Currently, I have a rough idea: the Table
API
should
also introduce
DynamicTable operation
interfaces
corresponding to the
existing Table interfaces.
The TableEnvironment
will provide relevant
methods to support various
dynamic
table
operations. The goal
for the new Dynamic Table is to
offer
users
an
experience similar to
using a database, which is why we
prioritize
SQL-based approaches
initially.

How do you envision
re-adding the functionality of a
statement
set,
that
fans out to multiple
tables? This is a very important
use
case
for
data
pipelines.


Multi-tables is indeed
a very important user scenario. In
the
future,
we can consider
extending the statement set syntax to
support
the
creation of multiple
dynamic tables.


Since the early
days of Flink SQL, we were discussing
`SELECT
STREAM
*
FROM T EMIT 5
MINUTES`. Your proposal seems to rephrase
STREAM
and
EMIT,
into other keywords
DYNAMIC TABLE and FRESHNESS. But the
core
functionality is
still there. I'm wondering if we should
widen
the
scope
(maybe not part of
this FLIP but a new FLIP) to follow
the
standard
more
closely. Making
`SELECT * FROM t` bounded by default and
use
new
syntax
for the dynamic
behavior. Flink 2.0 would be the perfect
time
for
this,
however, it would
require careful discussions. What do
you
think?


The query part indeed
requires a separate FLIP
for discussion, as it
involves changes to the default
behavior.


[1]











https://urldefense.com/v3/__https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/concepts/dynamic_tables__;!!IKRxdwAv5BmarQ!dVYcp9PUyjpBGzkYFxb2sdnmB0E22koc-YLdxY2LidExEHUJKRkyvRbAveqjlYFKWevFvmE1Z-j73477_wHn$


Best,

Ron


Jing Zhang <
beyond1...@gmail.com> 于2024年3月13日周三 15:19写道:

Hi, Lincoln & Ron,

Thanks for the
proposal.

I agree with the
question raised by Timo.

Besides, I have some
other questions.
1. How to define
query of dynamic table?
Use flink sql or
introducing new syntax?
If use flink sql,
how to handle the difference in SQL
between
streaming
and
batch processing?
For example, a query
including window aggregate based on
processing
time?
or a query including
global order by?

2. Whether modify
the query of dynamic table is allowed?
Or we could only
refresh a dynamic table based on
initial
query?

3. How to use
dynamic table?
The dynamic table
seems to be similar with materialized
view.
Will
we
do
something like
materialized view rewriting during the
optimization?

Best,
Jing Zhang


Timo Walther <
twal...@apache.org> 于2024年3月13日周三 01:24写
道:

Hi Lincoln & Ron,

thanks for
proposing this FLIP. I think a design
similar
to
what
you
propose has been
in the heads of many people, however,
I'm
wondering
how
this will fit
into the bigger picture.

I haven't deeply
reviewed the FLIP yet, but would like
to
ask
some
initial questions:

Flink has
introduced the concept of Dynamic Tables many
years
ago.
How
does the term
"Dynamic Table" fit into Flink's regular
tables
and
also
how does it
relate to Table API?

I fear that
adding the DYNAMIC TABLE keyword could
cause
confusion
for
users, because a
term for regular CREATE TABLE (that
can
be
"kind
of
dynamic" as well
and is backed by a changelog) is then
missing.
Also
given that we
call our connectors for those tables,
DynamicTableSource
and
DynamicTableSink.

In general, I
find it contradicting that a TABLE can be
"paused"
or
"resumed". From
an English language perspective, this
does
sound
incorrect. In my
opinion (without much research yet), a
continuous
updating trigger
should rather be modelled as a CREATE
MATERIALIZED
VIEW
(which users are
familiar with?) or a new concept such
as a
CREATE
TASK
(that can be
paused and resumed?).

How do you
envision re-adding the functionality of a
statement
set,
that
fans out to
multiple tables? This is a very important
use
case
for
data
pipelines.

Since the early
days of Flink SQL, we were discussing
`SELECT
STREAM
*
FROM T EMIT 5
MINUTES`. Your proposal seems to rephrase
STREAM
and
EMIT,
into other
keywords DYNAMIC TABLE and FRESHNESS. But
the
core
functionality is
still there. I'm wondering if we
should
widen
the
scope
(maybe not part
of this FLIP but a new FLIP) to follow
the
standard
more
closely. Making
`SELECT * FROM t` bounded by default
and
use
new
syntax
for the dynamic
behavior. Flink 2.0 would be the
perfect
time
for
this,
however, it would
require careful discussions. What do
you
think?

Regards,
Timo


On 11.03.24
08:23, Ron liu wrote:
Hi, Dev


Lincoln Lee
and I would like to start a discussion
about
FLIP-435:
Introduce a
New Dynamic Table for Simplifying Data
Pipelines.


This FLIP is
designed to simplify the development of
data
processing
pipelines.
With Dynamic Tables with uniform SQL
statements
and
freshness,
users can define batch and streaming
transformations
to
data in the
same way, accelerate ETL pipeline
development,
and
manage
task
scheduling automatically.


For more
details, see FLIP-435 [1]. Looking forward to
your
feedback.


[1]


Best,

Lincoln & Ron

























Reply via email to