Re: [DISCUSS]: Flink code generation porting to Calcite

2024-07-17 Thread Jark Wu
Hi James,

After reading the comments in CALCITE-3094, I think what you are looking
for is the Flink code-splitting tools.
Code splitting is a common need for Java code generation and Flink has
extracted the code splitting into a
separate module "flink-table-code-splitter"[1] with little dependencies.
And this is how Flink SQL use it [2].

Best,
Jark

[1]:
https://github.com/apache/flink/blob/master/flink-table/flink-table-code-splitter/pom.xml
[2]:
https://github.com/apache/flink/blob/master/flink-table/flink-table-runtime/src/main/java/org/apache/flink/table/runtime/generated/GeneratedClass.java#L58

On Thu, 18 Jul 2024 at 07:12, James Duong 
wrote:

> Hi Flink developer community,
>
> I’m contributing to the Apache Calcite project and there’s interest in
> making use of Flink’s code generation. See the comments on
> https://issues.apache.org/jira/browse/CALCITE-3094
>
> Would someone be able to point me to where Flink integrates Janino? I
> originally thought it would be related to this class:
> https://nightlies.apache.org/flink/flink-docs-release-1.3/api/java/org/apache/flink/table/codegen/CodeGenerator.html
>
> But I haven’t been able to find this on main.
>
> Thanks!
>


Re: [2.0] How to handle on-going feature development in Flink 2.0?

2024-06-25 Thread Jark Wu
I also think this should not block new feature development.
Having "nice-to-have" and "must-to-have" tags on the FLIPs is a good idea.

For the downstream projects, I think we need to release a 2.0 preview
version one or
two months before the formal release. This can leave some time for the
downstream
projects to integrate and provide feedback. So we can fix the problems
(e.g. unexpected
breaking changes, Java versions) before 2.0.

Best,
Jark

On Wed, 26 Jun 2024 at 09:39, Xintong Song  wrote:

> I also don't think we should block new feature development until 2.0. From
> my understanding, the new major release is no different from the regular
> minor releases for new features.
>
> I think tracking new features, either as nice-to-have items or in a
> separate list, is necessary. It helps us understand what's going on in the
> release cycle, and what to announce and promote. Maybe we should start a
> discussion on updating the 2.0 item list, to 1) collect new items that are
> proposed / initiated after the original list being created and 2) to remove
> some items that are no longer suitable. I'll discuss this with the other
> release managers first.
>
> For the connectors and operators, I think it depends on whether they depend
> on any deprecated APIs or internal implementations of Flink. Ideally,
> all @Public APIs and @PublicEvolving APIs that we plan to change / remove
> should have been deprecated in 1.19 and 1.20 respectively. That means if
> the connectors and operators only use non-deprecated @Puclib
> and @PublicEvolving APIs in 1.20, hopefully there should not be any
> problems upgrading to 2.0.
>
> Best,
>
> Xintong
>
>
>
> On Wed, Jun 26, 2024 at 5:20 AM Becket Qin  wrote:
>
> > Thanks for the question, Matthias.
> >
> > My two cents, I don't think we are blocking new feature development. My
> > understanding is that the community will just prioritize removing
> > deprecated APIs in the 2.0 dev cycle. Because of that, it is possible
> that
> > some new feature development may slow down a little bit since some
> > contributors may be working on the must-have features for 2.0. But policy
> > wise, I don't see a reason to block the new feature development for the
> 2.0
> > release feature plan[1].
> >
> > Process wise, I like your idea of adding the new features as nice-to-have
> > in the 2.0 feature list.
> >
> > Re: David,
> > Given it is a major version bump. It is possible that some of the
> > downstream projects (e.g. connectors, Paimon, etc) will have to see if a
> > major version bump is also needed there. And it is probably going to be
> > decisions made on a per-project basis.
> > Regarding the Java version specifically, this probably worth a separate
> > discussion. According to a recent report[2] on the state of Java, it
> might
> > be a little early to drop support for Java 11. We can discuss this
> > separately.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > [1] https://cwiki.apache.org/confluence/display/FLINK/2.0+Release
> > [2]
> >
> >
> https://newrelic.com/sites/default/files/2024-04/new-relic-state-of-the-java-ecosystem-report-2024-04-30.pdf
> >
> > On Tue, Jun 25, 2024 at 4:58 AM David Radley 
> > wrote:
> >
> > > Hi,
> > > I think this is a great question. I am not sure if this has been
> covered
> > > elsewhere, but it would be good to be clear how this effects the
> > connectors
> > > and operator repos, with potentially v1 and v2 oriented new featuresI
> > > suspect this will be a connector by connector investigation. I am
> > thinking
> > > connectors with Hadoop eco-system dependencies (e.g. Paimon) which may
> > not
> > > work nicely with Java 17,
> > >
> > >  Kind regards, David.
> > >
> > >
> > > From: Matthias Pohl 
> > > Date: Tuesday, 25 June 2024 at 09:57
> > > To: dev@flink.apache.org 
> > > Cc: Xintong Song , martijnvis...@apache.org <
> > > martijnvis...@apache.org>, imj...@gmail.com ,
> > > becket@gmail.com 
> > > Subject: [EXTERNAL] [2.0] How to handle on-going feature development in
> > > Flink 2.0?
> > > Hi 2.0 release managers,
> > > With the 1.20 release branch being cut [1], master is now referring to
> > > 2.0-SNAPSHOT. I remember that, initially, the community had the idea of
> > > keeping the 2.0 release as small as possible focusing on API changes
> [2].
> > >
> > > What does this mean for new features? I guess blocking them until 2.0
> is
> > > released is not a good option. Shall we treat new features as
> > > "nice-to-have" items as documented in the 2.0 release overview [3] and
> > > merge them to master like it was done for minor releases in the past?
> Do
> > > you want to add a separate section in the 2.0 release overview [3] to
> > list
> > > these new features (e.g. FLIP-461 [4]) separately? That might help to
> > > manage planned 2.0 deprecations/API removal and new features
> separately.
> > Or
> > > do you have a different process in mind?
> > >
> > > Apologies if this was already discussed somewhere. I didn't manage

Re: [VOTE] Apache Flink CDC Release 3.1.1, release candidate #0

2024-06-18 Thread Jark Wu
+1 (binding)

- Build and compile the source code locally: *OK*
- Verified signatures: *OK*
- Verified hashes: *OK*
- Checked no missing artifacts in the staging area: *OK*
- Reviewed the website release PR: *OK*
- Checked the licenses: *OK*

Best,
Jark

On Tue, 18 Jun 2024 at 18:14, Leonard Xu  wrote:

> +1 (binding)
>
> - verified signatures
> - verified hashsums
> - checked release notes
> - reviewed the web PR
> - tested Flink CDC works with Flink 1.19
> - tested route、transform in MySQL to Doris Pipeline
>
> Best,
> Leonard
>
>


Re: [ANNOUNCE] New Apache Flink PMC Member - Weijie Guo

2024-06-04 Thread Jark Wu
Congratulations, Weijie!

Best,
Jark

On Tue, 4 Jun 2024 at 19:10, spoon_lz  wrote:

> Congratulations, Weijie!
>
>
>
> Regards,
> Zhuo.
>
>
>
>
>
>  Replied Message 
> | From | Aleksandr Pilipenko |
> | Date | 06/4/2024 18:59 |
> | To |  |
> | Subject | Re: [ANNOUNCE] New Apache Flink PMC Member - Weijie Guo |
> Congratulations, Weijie!
>
> Best,
> Aleksandr
>
> On Tue, 4 Jun 2024 at 11:42, Abdulquddus Babatunde Ekemode <
> abdulqud...@aligence.io> wrote:
>
> Congratulations! I wish you all the best.
>
> Best Regards,
> Abdulquddus
>
> On Tue, 4 Jun 2024 at 13:14, Ahmed Hamdy  wrote:
>
> Congratulations Weijie
> Best Regards
> Ahmed Hamdy
>
>
> On Tue, 4 Jun 2024 at 10:51, Matthias Pohl  wrote:
>
> Congratulations, Weijie!
>
> Matthias
>
> On Tue, Jun 4, 2024 at 11:12 AM Guowei Ma 
> wrote:
>
> Congratulations!
>
> Best,
> Guowei
>
>
> On Tue, Jun 4, 2024 at 4:55 PM gongzhongqiang <
> gongzhongqi...@apache.org
>
> wrote:
>
> Congratulations Weijie! Best,
> Zhongqiang Gong
>
> Xintong Song  于2024年6月4日周二 14:46写道:
>
> Hi everyone,
>
> On behalf of the PMC, I'm very happy to announce that Weijie Guo
> has
> joined
> the Flink PMC!
>
> Weijie has been an active member of the Apache Flink community
> for
> many
> years. He has made significant contributions in many components,
> including
> runtime, shuffle, sdk, connectors, etc. He has driven /
> participated
> in
> many FLIPs, authored and reviewed hundreds of PRs, been
> consistently
> active
> on mailing lists, and also helped with release management of 1.20
> and
> several other bugfix releases.
>
> Congratulations and welcome Weijie!
>
> Best,
>
> Xintong (on behalf of the Flink PMC)
>
>
>
>
>
>
>


Re: [VOTE] FLIP-457: Improve Table/SQL Configuration for Flink 2.0

2024-05-27 Thread Jark Wu
+1 (binding)

Best,
Jark

On Mon, 27 May 2024 at 14:29, Hang Ruan  wrote:

> +1 (non-binding)
>
> Best,
> Hang
>
> gongzhongqiang  于2024年5月27日周一 14:16写道:
>
> > +1 (non-binding)
> >
> > Best,
> > Zhongqiang Gong
> >
> > Jane Chan  于2024年5月24日周五 09:52写道:
> >
> > > Hi all,
> > >
> > > I'd like to start a vote on FLIP-457[1] after reaching a consensus
> > through
> > > the discussion thread[2].
> > >
> > > The vote will be open for at least 72 hours unless there is an
> objection
> > or
> > > insufficient votes.
> > >
> > >
> > > [1]
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=307136992
> > > [2] https://lists.apache.org/thread/1sthbv6q00sq52pp04n2p26d70w4fqj1
> > >
> > > Best,
> > > Jane
> > >
> >
>


Re: [VOTE] Apache Flink CDC Release 3.1.0, release candidate #3

2024-05-16 Thread Jark Wu
+1 (binding)

- checked signatures
- checked hashes
- checked release notes
- reviewed the release web PR
- checked the jars in the staging repo
- build and compile the source code locally with jdk8

Best,
Jark

On Wed, 15 May 2024 at 16:05, gongzhongqiang 
wrote:

> +1 (non-binding)
>
> - Verified signature and checksum hash
> - Verified that no binaries exist in the source archive
> - Build source code successful on ubuntu 22.04 with jdk8
> - Check tag and branch exist
> - Check jars are built by jdk8
>
> Best,
> Zhongqiang Gong
>
> Qingsheng Ren  于2024年5月11日周六 10:10写道:
>
> > Hi everyone,
> >
> > Please review and vote on the release candidate #3 for the version 3.1.0
> of
> > Apache Flink CDC, as follows:
> > [ ] +1, Approve the release
> > [ ] -1, Do not approve the release (please provide specific comments)
> >
> > **Release Overview**
> >
> > As an overview, the release consists of the following:
> > a) Flink CDC source release to be deployed to dist.apache.org
> > b) Maven artifacts to be deployed to the Maven Central Repository
> >
> > **Staging Areas to Review**
> >
> > The staging areas containing the above mentioned artifacts are as
> follows,
> > for your review:
> > * All artifacts for a) can be found in the corresponding dev repository
> at
> > dist.apache.org [1], which are signed with the key with fingerprint
> > A1BD477F79D036D2C30CA7DBCA8AEEC2F6EB040B [2]
> > * All artifacts for b) can be found at the Apache Nexus Repository [3]
> >
> > Other links for your review:
> > * JIRA release notes [4]
> > * Source code tag "release-3.1.0-rc3" with commit hash
> > 5452f30b704942d0ede64ff3d4c8699d39c63863 [5]
> > * PR for release announcement blog post of Flink CDC 3.1.0 in flink-web
> [6]
> >
> > **Vote Duration**
> >
> > The voting time will run for at least 72 hours, adopted by majority
> > approval with at least 3 PMC affirmative votes.
> >
> > Thanks,
> > Qingsheng Ren
> >
> > [1] https://dist.apache.org/repos/dist/dev/flink/flink-cdc-3.1.0-rc3/
> > [2] https://dist.apache.org/repos/dist/release/flink/KEYS
> > [3]
> https://repository.apache.org/content/repositories/orgapacheflink-1733
> > [4]
> >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12354387
> > [5] https://github.com/apache/flink-cdc/releases/tag/release-3.1.0-rc3
> > [6] https://github.com/apache/flink-web/pull/739
> >
>


Re: [DISCUSSION] FLIP-457: Improve Table/SQL Configuration for Flink 2.0

2024-05-16 Thread Jark Wu
Hi Jane,

Thanks for the proposal. +1 from my side.


Best,
Jark

On Thu, 16 May 2024 at 10:28, Xuannan Su  wrote:

> Hi Jane,
>
> Thanks for driving this effort! And +1 for the proposed changes.
>
> I have one comment on the migration plan.
>
> For options to be moved to another module/package, I think we have to
> mark the old option deprecated in 1.20 for it to be removed in 2.0,
> according to the API compatibility guarantees[1]. We can introduce the
> new option in 1.20 with the same option key in the intended class.
> WDYT?
>
> Best,
> Xuannan
>
> [1]
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/upgrading/#api-compatibility-guarantees
>
>
>
> On Wed, May 15, 2024 at 6:20 PM Jane Chan  wrote:
> >
> > Hi all,
> >
> > I'd like to start a discussion on FLIP-457: Improve Table/SQL
> Configuration
> > for Flink 2.0 [1]. This FLIP revisited all Table/SQL configurations to
> > improve user-friendliness and maintainability as Flink moves toward 2.0.
> >
> > I am looking forward to your feedback.
> >
> > Best regards,
> > Jane
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=307136992
>


Re: Re: [VOTE] FLIP-448: Introduce Pluggable Workflow Scheduler Interface for Materialized Table

2024-05-09 Thread Jark Wu
+1 (binding)

Best,
Jark

On Thu, 9 May 2024 at 21:27, Lincoln Lee  wrote:

> +1 (binding)
>
> Best,
> Lincoln Lee
>
>
> Feng Jin  于2024年5月9日周四 19:45写道:
>
> > +1 (non-binding)
> >
> >
> > Best,
> > Feng
> >
> >
> > On Thu, May 9, 2024 at 7:37 PM Xuyang  wrote:
> >
> > > +1 (non-binding)
> > >
> > >
> > > --
> > >
> > > Best!
> > > Xuyang
> > >
> > >
> > >
> > >
> > >
> > > At 2024-05-09 13:57:07, "Ron Liu"  wrote:
> > > >Sorry for the re-post, just to format this email content.
> > > >
> > > >Hi Dev
> > > >
> > > >Thank you to everyone for the feedback on FLIP-448: Introduce
> Pluggable
> > > >Workflow Scheduler Interface for Materialized Table[1][2].
> > > >I'd like to start a vote for it. The vote will be open for at least 72
> > > >hours unless there is an objection or not enough votes.
> > > >
> > > >[1]
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-448%3A+Introduce+Pluggable+Workflow+Scheduler+Interface+for+Materialized+Table
> > > >
> > > >[2] https://lists.apache.org/thread/57xfo6p25rbrhcg01dhyok46zt6jc5q1
> > > >
> > > >Best,
> > > >Ron
> > > >
> > > >Ron Liu  于2024年5月9日周四 13:52写道:
> > > >
> > > >> Hi Dev, Thank you to everyone for the feedback on FLIP-448:
> Introduce
> > > >> Pluggable Workflow Scheduler Interface for Materialized Table[1][2].
> > I'd
> > > >> like to start a vote for it. The vote will be open for at least 72
> > hours
> > > >> unless there is an objection or not enough votes. [1]
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-448%3A+Introduce+Pluggable+Workflow+Scheduler+Interface+for+Materialized+Table
> > > >>
> > > >> [2]
> https://lists.apache.org/thread/57xfo6p25rbrhcg01dhyok46zt6jc5q1
> > > >> Best, Ron
> > > >>
> > >
> >
>


Re: [VOTE] FLIP-436: Introduce Catalog-related Syntax

2024-04-25 Thread Jark Wu
Thanks for driving this, Jane and Yubin.

+1. The new layout looks good to me.

Best,
Jark

On Fri, 26 Apr 2024 at 13:57, Jane Chan  wrote:

> Hi Yubin,
>
> Thanks for your effort. +1 with the display layout change (binding).
>
> Best,
> Jane
>
> On Wed, Apr 24, 2024 at 5:28 PM Ahmed Hamdy  wrote:
>
> > Hi, +1 (non-binding)
> > Best Regards
> > Ahmed Hamdy
> >
> >
> > On Wed, 24 Apr 2024 at 09:58, Yubin Li  wrote:
> >
> > > Hi everyone,
> > >
> > > During the implementation of the "describe catalog" syntax, it was
> > > found that the original output style needed to be improved.
> > > ```
> > > desc catalog extended cat2;
> > >
> > >
> >
> +--+-+
> > > | catalog_description_item |
> > > catalog_description_value |
> > >
> > >
> >
> +--+-+
> > > | Name |
> > >  cat2 |
> > > | Type |
> > > generic_in_memory |
> > > |  Comment |
> > >   |
> > > |   Properties | ('default-database','db'),
> > > ('type','generic_in_memory') |
> > >
> > >
> >
> +--+-+
> > > 4 rows in set
> > > ```
> > > After offline discussions with Jane Chan and Jark Wu, we suggest
> > > improving it to the following form:
> > > ```
> > > desc catalog extended cat2;
> > > +-+---+
> > > |   info name |info value |
> > > +-+---+
> > > |name |  cat2 |
> > > |type | generic_in_memory |
> > > | comment |   |
> > > | option:default-database |db |
> > > +-+---+
> > > 4 rows in set
> > > ```
> > >
> > > For the following reasons:
> > > 1. The title should be consistent with engines such as Databricks for
> > > easy understanding, and it should also be consistent with Flink's own
> > > naming style. Therefore, the title adopts "info name", "info value",
> > > and the key name should be unified in lowercase, so "Name" is replaced
> > > by "name".
> > > Note: Databricks output style [1] as follows:
> > > ```
> > > > DESCRIBE CATALOG main;
> > >  info_name info_value
> > >    
> > >  Catalog Name  main
> > >   Comment   Main catalog (auto-created)
> > > Owner metastore-admin-users
> > >  Catalog Type   Regular
> > > ```
> > > 2. There may be many attributes of the catalog, and it is very poor in
> > > readability when displayed in one line. It should be expanded into
> > > multiple lines, and the key name is prefixed with "option:" to
> > > identify that this is an attribute row. And since `type` is an
> > > important information of the catalog, even if `extended` is not
> > > specified, it should also be displayed, and correspondingly,
> > > "option:type" should be removed to avoid redundancy.
> > >
> > > WDYT? Looking forward to your reply!
> > >
> > > [1]
> > >
> >
> https://learn.microsoft.com/zh-tw/azure/databricks/sql/language-manual/sql-ref-syntax-aux-describe-catalog
> > >
> > > Best,
> > > Yubin
> > >
> > > On Wed, Mar 20, 2024 at 2:15 PM Benchao Li 
> wrote:
> > > >
> > > > +1 (binding)
> > > >
> > > > gongzhongqiang  于2024年3月20日周三 11:40写道:
> > > > >
> > > > > +1 (non-binding)
> > > > >
> > > > > Best,
> > > > > Zhongqiang Gong
> > > > >
> > > > > Yubin Li  于2024年3月19日周二 18:03写道:
> > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > Thanks for all the feedback, I'd like to start a vote on the
> > > FLIP-436:
> > > > > > Introduce Catalog-related Syntax [1]. The discussion thread is
> here
> > > > > > [2].
> > > > > >
> > > > > > The vote will be open for at least 72 hours unless there is an
> > > > > > objection or insufficient votes.
> > > > > >
> > > > > > [1]
> > > > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-436%3A+Introduce+Catalog-related+Syntax
> > > > > > [2]
> > https://lists.apache.org/thread/10k1bjb4sngyjwhmfqfky28lyoo7sv0z
> > > > > >
> > > > > > Best regards,
> > > > > > Yubin
> > > > > >
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Best,
> > > > Benchao Li
> > >
> >
>


Re: [VOTE] FLIP-435: Introduce a New Materialized Table for Simplifying Data Pipelines

2024-04-17 Thread Jark Wu
+1 (binding)

Best,
Jark

On Wed, 17 Apr 2024 at 20:52, Leonard Xu  wrote:

> +1(binding)
>
> Best,
> Leonard
>
> > 2024年4月17日 下午8:31,Lincoln Lee  写道:
> >
> > +1(binding)
> >
> > Best,
> > Lincoln Lee
> >
> >
> > Ferenc Csaky  于2024年4月17日周三 19:58写道:
> >
> >> +1 (non-binding)
> >>
> >> Best,
> >> Ferenc
> >>
> >>
> >>
> >>
> >> On Wednesday, April 17th, 2024 at 10:26, Ahmed Hamdy <
> hamdy10...@gmail.com>
> >> wrote:
> >>
> >>>
> >>>
> >>> + 1 (non-binding)
> >>>
> >>> Best Regards
> >>> Ahmed Hamdy
> >>>
> >>>
> >>> On Wed, 17 Apr 2024 at 08:28, Yuepeng Pan panyuep...@apache.org wrote:
> >>>
>  +1(non-binding).
> 
>  Best,
>  Yuepeng Pan
> 
>  At 2024-04-17 14:27:27, "Ron liu" ron9@gmail.com wrote:
> 
> > Hi Dev,
> >
> > Thank you to everyone for the feedback on FLIP-435: Introduce a New
> > Materialized Table for Simplifying Data Pipelines[1][2].
> >
> > I'd like to start a vote for it. The vote will be open for at least
> >> 72
> > hours unless there is an objection or not enough votes.
> >
> > [1]
> 
> 
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-435%3A+Introduce+a+New+Materialized+Table+for+Simplifying+Data+Pipelines
> 
> > [2] https://lists.apache.org/thread/c1gnn3bvbfs8v1trlf975t327s4rsffs
> >
> > Best,
> > Ron
> >>
>
>


[ANNOUNCE] New Apache Flink PMC Member - Jing Ge

2024-04-12 Thread Jark Wu
Hi everyone,

On behalf of the PMC, I'm very happy to announce that Jing Ge has
joined the Flink PMC!

Jing has been contributing to Apache Flink for a long time. He continuously
works on SQL, connectors, Source, and Sink APIs, test, and document
modules while contributing lots of code and insightful discussions. He is
one of the maintainers of Flink CI infra. He is also willing to help a lot
in the
community work, such as being the release manager for both 1.18 and 1.19,
verifying releases, and answering questions on the mailing list. Besides
that,
he is continuously helping with the expansion of the Flink community and
has
given several talks about Flink at many conferences, such as Flink Forward
2022 and 2023.

Congratulations and welcome Jing!

Best,
Jark (on behalf of the Flink PMC)


[ANNOUNCE] New Apache Flink PMC Member - Lincoln Lee

2024-04-12 Thread Jark Wu
Hi everyone,

On behalf of the PMC, I'm very happy to announce that Lincoln Lee has
joined the Flink PMC!

Lincoln has been an active member of the Apache Flink community for
many years. He mainly works on Flink SQL component and has driven
/pushed many FLIPs around SQL, including FLIP-282/373/415/435 in
the recent versions. He has a great technical vision of Flink SQL and
participated in plenty of discussions in the dev mailing list. Besides
that,
he is community-minded, such as being the release manager of 1.19,
verifying releases, managing release syncs, writing the release
announcement etc.

Congratulations and welcome Lincoln!

Best,
Jark (on behalf of the Flink PMC)


Re: [DISCUSS] FLIP-435: Introduce a New Dynamic Table for Simplifying Data Pipelines

2024-04-09 Thread Jark Wu
gt;> using the
> > > >>>> >> FRESHNESS keyword is very obscure for users.
> > > >>>> >> 2. It intrudes on the semantics of the CTAS syntax. Currently,
> > > tables
> > > >>>> >> created using CTAS only add table metadata to the Catalog and
> do
> > > not
> > > >>>> record
> > > >>>> >> attributes such as query. There are also no ongoing background
> > > >>>> refresh
> > > >>>> >> jobs, and the data writing operation happens only once at table
> > > >>>> creation.
> > > >>>> >> 3. For the framework, when we perform a certain kind of Alter
> > Table
> > > >>>> >> behavior for a table, for the table created by specifying
> > FRESHNESS
> > > >>>> and did
> > > >>>> >> not specify the FRESHNESS created table behavior how to
> > distinguish
> > > >>>> , which
> > > >>>> >> will also cause confusion.
> > > >>>> >>
> > > >>>> >> In terms of the design goal of combining Dynamic Table +
> > Continuous
> > > >>>> Query,
> > > >>>> >> the FLIP proposal cannot be realized by only extending the
> > current
> > > >>>> stardand
> > > >>>> >> tables, so a new kind of dynamic table needs to be introduced
> at
> > > the
> > > >>>> >> first-level concept.
> > > >>>> >>
> > > >>>> >> [1]
> > > >>>> >>
> > > >>>>
> > >
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/create/#as-select_statement
> > > >>>> >>
> > > >>>> >> Best,
> > > >>>> >> Ron
> > > >>>> >>
> > > >>>> >>  于2024年4月3日周三 22:25写道:
> > > >>>> >>
> > > >>>> >>> Hello everybody!
> > > >>>> >>> Thanks for the FLIP as it looks amazing (and I think the prove
> > is
> > > >>>> this
> > > >>>> >>> deep discussion it is provoking :))
> > > >>>> >>>
> > > >>>> >>> I have a couple of comments to add to this:
> > > >>>> >>>
> > > >>>> >>> Even though I get the reason why you rejected MATERIALIZED
> > VIEW, I
> > > >>>> still
> > > >>>> >>> like it a lot, and I would like to provide pointers on how the
> > > >>>> materialized
> > > >>>> >>> view concept twisted in last years:
> > > >>>> >>>
> > > >>>> >>> • Materialize DB (https://materialize.com/)
> > > >>>> >>> • The famous talk by Martin Kleppmann "turning the database
> > inside
> > > >>>> out" (
> > > >>>> >>> https://www.youtube.com/watch?v=fU9hR3kiOK0)
> > > >>>> >>>
> > > >>>> >>> I think the 2 above twisted the materialized view concept to
> > more
> > > >>>> than
> > > >>>> >>> just an optimization for accessing pre-computed
> > > aggregates/filters.
> > > >>>> >>> I think that concept (at least in my mind) is now adherent to
> > the
> > > >>>> >>> semantics of the words themselves ("materialized" and "view")
> > than
> > > >>>> on its
> > > >>>> >>> implementations in DBMs, as just a view on raw data that,
> > > >>>> hopefully, is
> > > >>>> >>> constantly updated with fresh results.
> > > >>>> >>> That's why I understand Timo's et al. objections.
> > > >>>> >>> Still I understand there is no need to add confusion :)
> > > >>>> >>>
> > > >>>> >>> Still, I don't understand why we need another type of special
> > > table.
> > > >>>> >>> Could you dive deep into the reasons why not simply adding the
> > > >>>> FRESHNESS
> >

Re: [VOTE] FLIP-437: Support ML Models in Flink SQL

2024-04-02 Thread Jark Wu
+1 (binding)

Best,
Jark

On Tue, 2 Apr 2024 at 15:12, Timo Walther  wrote:

> +1 (binding)
>
> Thanks,
> Timo
>
> On 29.03.24 17:30, Hao Li wrote:
> > Hi devs,
> >
> > I'd like to start a vote on the FLIP-437: Support ML Models in Flink
> > SQL [1]. The discussion thread is here [2].
> >
> > The vote will be open for at least 72 hours unless there is an objection
> or
> > insufficient votes.
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-437%3A+Support+ML+Models+in+Flink+SQL
> >
> > [2] https://lists.apache.org/thread/9z94m2bv4w265xb5l2mrnh4lf9m28ccn
> >
> > Thanks,
> > Hao
> >
>
>


Re: Re: [ANNOUNCE] Apache Paimon is graduated to Top Level Project

2024-03-30 Thread Jark Wu
Congratulations!

Best,
Jark

On Fri, 29 Mar 2024 at 12:08, Yun Tang  wrote:

> Congratulations to all Paimon guys!
>
> Glad to see a Flink sub-project has been graduated to an Apache top-level
> project.
>
> Best
> Yun Tang
>
> 
> From: Hangxiang Yu 
> Sent: Friday, March 29, 2024 10:32
> To: dev@flink.apache.org 
> Subject: Re: Re: [ANNOUNCE] Apache Paimon is graduated to Top Level Project
>
> Congratulations!
>
> On Fri, Mar 29, 2024 at 10:27 AM Benchao Li  wrote:
>
> > Congratulations!
> >
> > Zakelly Lan  于2024年3月29日周五 10:25写道:
> > >
> > > Congratulations!
> > >
> > >
> > > Best,
> > > Zakelly
> > >
> > > On Thu, Mar 28, 2024 at 10:13 PM Jing Ge 
> > wrote:
> > >
> > > > Congrats!
> > > >
> > > > Best regards,
> > > > Jing
> > > >
> > > > On Thu, Mar 28, 2024 at 1:27 PM Feifan Wang 
> > wrote:
> > > >
> > > > > Congratulations!——
> > > > >
> > > > > Best regards,
> > > > >
> > > > > Feifan Wang
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > At 2024-03-28 20:02:43, "Yanfei Lei"  wrote:
> > > > > >Congratulations!
> > > > > >
> > > > > >Best,
> > > > > >Yanfei
> > > > > >
> > > > > >Zhanghao Chen  于2024年3月28日周四 19:59写道:
> > > > > >>
> > > > > >> Congratulations!
> > > > > >>
> > > > > >> Best,
> > > > > >> Zhanghao Chen
> > > > > >> 
> > > > > >> From: Yu Li 
> > > > > >> Sent: Thursday, March 28, 2024 15:55
> > > > > >> To: d...@paimon.apache.org 
> > > > > >> Cc: dev ; user 
> > > > > >> Subject: Re: [ANNOUNCE] Apache Paimon is graduated to Top Level
> > > > Project
> > > > > >>
> > > > > >> CC the Flink user and dev mailing list.
> > > > > >>
> > > > > >> Paimon originated within the Flink community, initially known as
> > Flink
> > > > > >> Table Store, and all our incubating mentors are members of the
> > Flink
> > > > > >> Project Management Committee. I am confident that the bonds of
> > > > > >> enduring friendship and close collaboration will continue to
> > unite the
> > > > > >> two communities.
> > > > > >>
> > > > > >> And congratulations all!
> > > > > >>
> > > > > >> Best Regards,
> > > > > >> Yu
> > > > > >>
> > > > > >> On Wed, 27 Mar 2024 at 20:35, Guojun Li <
> gjli.schna...@gmail.com>
> > > > > wrote:
> > > > > >> >
> > > > > >> > Congratulations!
> > > > > >> >
> > > > > >> > Best,
> > > > > >> > Guojun
> > > > > >> >
> > > > > >> > On Wed, Mar 27, 2024 at 5:24 PM wulin 
> > wrote:
> > > > > >> >
> > > > > >> > > Congratulations~
> > > > > >> > >
> > > > > >> > > > 2024年3月27日 15:54,王刚  写道:
> > > > > >> > > >
> > > > > >> > > > Congratulations~
> > > > > >> > > >
> > > > > >> > > >> 2024年3月26日 10:25,Jingsong Li 
> 写道:
> > > > > >> > > >>
> > > > > >> > > >> Hi Paimon community,
> > > > > >> > > >>
> > > > > >> > > >> I’m glad to announce that the ASF board has approved a
> > > > > resolution to
> > > > > >> > > >> graduate Paimon into a full Top Level Project. Thanks to
> > > > > everyone for
> > > > > >> > > >> your help to get to this point.
> > > > > >> > > >>
> > > > > >> > > >> I just created an issue to track the things we need to
> > modify
> > > > > [2],
> > > > > >> > > >> please comment on it if you feel that something is
> > missing. You
> > > > > can
> > > > > >> > > >> refer to apache documentation [1] too.
> > > > > >> > > >>
> > > > > >> > > >> And, we already completed the GitHub repo migration [3],
> > please
> > > > > update
> > > > > >> > > >> your local git repo to track the new repo [4].
> > > > > >> > > >>
> > > > > >> > > >> You can run the following command to complete the remote
> > repo
> > > > > tracking
> > > > > >> > > >> migration.
> > > > > >> > > >>
> > > > > >> > > >> git remote set-url origin
> > https://github.com/apache/paimon.git
> > > > > >> > > >>
> > > > > >> > > >> If you have a different name, please change the 'origin'
> to
> > > > your
> > > > > remote
> > > > > >> > > name.
> > > > > >> > > >>
> > > > > >> > > >> Please join me in celebrating!
> > > > > >> > > >>
> > > > > >> > > >> [1]
> > > > > >> > >
> > > > >
> > > >
> >
> https://incubator.apache.org/guides/transferring.html#life_after_graduation
> > > > > >> > > >> [2] https://github.com/apache/paimon/issues/3091
> > > > > >> > > >> [3] https://issues.apache.org/jira/browse/INFRA-25630
> > > > > >> > > >> [4] https://github.com/apache/paimon
> > > > > >> > > >>
> > > > > >> > > >> Best,
> > > > > >> > > >> Jingsong Lee
> > > > > >> > >
> > > > > >> > >
> > > > >
> > > >
> >
> >
> >
> > --
> >
> > Best,
> > Benchao Li
> >
>
>
> --
> Best,
> Hangxiang.
>


Re: [DISCUSS] FLIP-437: Support ML Models in Flink SQL

2024-03-28 Thread Jark Wu
Thanks, Hao,

Sounds good to me.

Best,
Jark

On Thu, 28 Mar 2024 at 01:02, Hao Li  wrote:

> Hi Jark,
>
> I think we can start with supporting popular model providers such as
> openai, azureml, sagemaker for remote models.
>
> Thanks,
> Hao
>
> On Tue, Mar 26, 2024 at 8:15 PM Jark Wu  wrote:
>
> > Thanks for the PoC and updating,
> >
> > The final syntax looks good to me, at least it is a nice and concise
> first
> > step.
> >
> > SELECT f1, f2, label FROM
> >ML_PREDICT(
> >  input => `my_data`,
> >  model => `my_cat`.`my_db`.`classifier_model`,
> >  args => DESCRIPTOR(f1, f2));
> >
> > Besides, what built-in models will we support in the FLIP? This might be
> > important
> > because it relates to what use cases can run with the new Flink version
> out
> > of the box.
> >
> > Best,
> > Jark
> >
> > On Wed, 27 Mar 2024 at 01:10, Hao Li  wrote:
> >
> > > Hi Timo,
> > >
> > > Yeah. For `primary key` and `from table(...)` those are explicitly
> > matched
> > > in parser: [1].
> > >
> > > > SELECT f1, f2, label FROM
> > >ML_PREDICT(
> > >  input => `my_data`,
> > >  model => `my_cat`.`my_db`.`classifier_model`,
> > >  args => DESCRIPTOR(f1, f2));
> > >
> > > This named argument syntax looks good to me. It can be supported
> together
> > > with
> > >
> > > SELECT f1, f2, label FROM ML_PREDICT(`my_data`,
> > > `my_cat`.`my_db`.`classifier_model`,DESCRIPTOR(f1, f2));
> > >
> > > Sure. Will let you know once updated the FLIP.
> > >
> > > [1]
> > >
> > >
> >
> https://github.com/confluentinc/flink/blob/release-1.18-confluent/flink-table/flink-sql-parser/src/main/codegen/includes/parserImpls.ftl#L814
> > >
> > > Thanks,
> > > Hao
> > >
> > > On Tue, Mar 26, 2024 at 4:15 AM Timo Walther 
> wrote:
> > >
> > > > Hi Hao,
> > > >
> > > >  > `TABLE(my_data)` and `MODEL(my_cat.my_db.classifier_model)`
> doesn't
> > > >  > work since `TABLE` and `MODEL` are already key words
> > > >
> > > > This argument doesn't count. The parser supports introducing keywords
> > > > that are still non-reserved. For example, this enables using "key"
> for
> > > > both primary key and a column name:
> > > >
> > > > CREATE TABLE t (i INT PRIMARY KEY NOT ENFORCED)
> > > > WITH ('connector' = 'datagen');
> > > >
> > > > SELECT i AS key FROM t;
> > > >
> > > > I'm sure we will introduce `TABLE(my_data)` eventually as this is
> what
> > > > the standard dictates. But for now, let's use the most compact syntax
> > > > possible which is also in sync with Oracle.
> > > >
> > > > TLDR: We allow identifiers as arguments for PTFs which are expanded
> > with
> > > > catalog and database if necessary. Those identifier arguments
> translate
> > > > to catalog lookups for table and models. The ML_ functions will make
> > > > sure that the arguments are of correct type model or table.
> > > >
> > > > SELECT f1, f2, label FROM
> > > >ML_PREDICT(
> > > >  input => `my_data`,
> > > >  model => `my_cat`.`my_db`.`classifier_model`,
> > > >  args => DESCRIPTOR(f1, f2));
> > > >
> > > > So this will allow us to also use in the future:
> > > >
> > > > SELECT * FROM poly_func(table1);
> > > >
> > > > Same support as Oracle [1]. Very concise.
> > > >
> > > > Let me know when you updated the FLIP for a final review before
> voting.
> > > >
> > > > Do others have additional objections?
> > > >
> > > > Regards,
> > > > Timo
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://livesql.oracle.com/apex/livesql/file/content_HQK7TYEO0NHSJCDY3LN2ERDV6.html
> > > >
> > > >
> > > >
> > > > On 25.03.24 23:40, Hao Li wrote:
> > > > > Hi Timo,
> > > > >
> > > > >> Please double check if this is implementable with the current
> > stack. I
> > > > > fear the parser or validator might not like the "identifier"
> > a

Re: [DISCUSS] FLIP-437: Support ML Models in Flink SQL

2024-03-26 Thread Jark Wu
t; SELECT f1, f2, label FROM
> > >>> ML_PREDICT(
> > >>>   input => TABLE `my_data`,
> > >>>   model => my_cat.my_db.classifier_model,
> > >>>   args => DESCRIPTOR(f1, f2))
> > >>>
> > >>> Some feedback for the remainder of the FLIP:
> > >>>
> > >>> 1) Simplify catalog objects
> > >>>
> > >>> I would suggest to drop:
> > >>> CatalogModel.getModelKind()
> > >>> CatalogModel.getModelTask()
> > >>>
> > >>> A catalog object should fully resemble the DDL. And since the DDL
> puts
> > >>> those properties in the WITH clause, the catalog object should the
> same
> > >>> (i.e. put them into the `getModelOptions()`). Btw renaming this
> method
> > >>> to just `getOptions()` for consistency should be good as well.
> > >>> Internally, we can still provide enums for these frequently used
> > >>> classes. Similar to what we do in `FactoryUtil` for other frequently
> > >>> used options.
> > >>>
> > >>> Remove `getDescription()` and `getDetailedDescription()`. They were a
> > >>> mistake for CatalogTable and should actually be deprecated. They got
> > >>> replaced by `getComment()` which is sufficient.
> > >>>
> > >>> 2) CREATE TEMPORARY MODEL is not supported.
> > >>>
> > >>> This is an unnecessary restriction. We should support temporary
> > versions
> > >>> of these catalog objects as well for consistency. Adding support for
> > >>> this should be straightforward.
> > >>>
> > >>> 3) DESCRIBE | DESC } MODEL [catalog_name.][database_name.]model_name
> > >>>
> > >>> I would suggest we support `SHOW CREATE MODEL` instead. Similar to
> > `SHOW
> > >>> CREATE TABLE`, this should show all properties. If we support
> `DESCRIBE
> > >>> MODEL` it should only list the input parameters similar to `DESCRIBE
> > >>> TABLE` only shows the columns (not the WITH clause).
> > >>>
> > >>> Regards,
> > >>> Timo
> > >>>
> > >>>
> > >>> On 23.03.24 13:17, Ahmed Hamdy wrote:
> > >>>> Hi everyone,
> > >>>> +1 for this proposal, I believe it is very useful to the minimum, It
> > >>> would
> > >>>> be great even having  "ML_PREDICT" and "ML_EVALUATE" as built-in
> PTFs
> > in
> > >>>> this FLIP as discussed.
> > >>>> IIUC this will be included in the 1.20 roadmap?
> > >>>> Best Regards
> > >>>> Ahmed Hamdy
> > >>>>
> > >>>>
> > >>>> On Fri, 22 Mar 2024 at 23:54, Hao Li 
> > wrote:
> > >>>>
> > >>>>> Hi Timo and Jark,
> > >>>>>
> > >>>>> I agree Oracle's syntax seems concise and more descriptive. For the
> > >>>>> built-in `ML_PREDICT` and `ML_EVALUATE` functions I agree with Jark
> > we
> > >>> can
> > >>>>> support them as built-in PTF using `SqlTableFunction` for this
> FLIP.
> > >>> We can
> > >>>>> have a different FLIP discussing user defined PTF and adopt that
> > later
> > >>> for
> > >>>>> model functions later. To summarize, the current proposed syntax is
> > >>>>>
> > >>>>> SELECT f1, f2, label FROM TABLE(ML_PREDICT(TABLE `my_data`,
> > >>>>> `classifier_model`, f1, f2))
> > >>>>>
> > >>>>> SELECT * FROM TABLE(ML_EVALUATE(TABLE `eval_data`,
> > `classifier_model`,
> > >>> f1,
> > >>>>> f2))
> > >>>>>
> > >>>>> Is `DESCRIPTOR` a must in the syntax? If so, it becomes
> > >>>>>
> > >>>>> SELECT f1, f2, label FROM TABLE(ML_PREDICT(TABLE `my_data`,
> > >>>>> `classifier_model`, DESCRIPTOR(f1), DESCRIPTOR(f2)))
> > >>>>>
> > >>>>> SELECT * FROM TABLE(ML_EVALUATE(TABLE `eval_data`,
> > `classifier_model`,
> > >>>>> DESCRIPTOR(f1), DESCRIPTOR(f2)))
> > >>>>>
> > >>>>> If Calcite supports dropping outer table keyword, it becomes
> > >>>>

Re: [DISCUSS] FLIP-437: Support ML Models in Flink SQL

2024-03-22 Thread Jark Wu
Sorry, I mean we can bump the Calcite version if needed in Flink 1.20.

On Fri, 22 Mar 2024 at 22:19, Jark Wu  wrote:

> Hi Timo,
>
> Introducing user-defined PTF is very useful in Flink, I'm +1 for this.
> But I think the ML model FLIP is not blocked by this, because we
> can introduce ML_PREDICT and ML_EVALUATE as built-in PTFs
> just like TUMBLE/HOP. And support user-defined ML functions as
> a future FLIP.
>
> Regarding the simplified PTF syntax which reduces the outer TABLE()
> keyword,
> it seems it was just supported[1] by the Calcite community last month and
> will be
> released in the next version (v1.37). The Calcite community is preparing
> the
> 1.37 release, so we can bump the version if needed in Flink 1.19.
>
> Best,
> Jark
>
> [1]: https://issues.apache.org/jira/browse/CALCITE-6254
>
> On Fri, 22 Mar 2024 at 21:46, Timo Walther  wrote:
>
>> Hi everyone,
>>
>> this is a very important change to the Flink SQL syntax but we can't
>> wait until the SQL standard is ready for this. So I'm +1 on introducing
>> the MODEL concept as a first class citizen in Flink.
>>
>> For your information: Over the past months I have already spent a
>> significant amount of time thinking about how we can introduce PTFs in
>> Flink. I reserved FLIP-440[1] for this purpose and I will share a
>> version of this in the next 1-2 weeks.
>>
>> For a good implementation of FLIP-440 and also FLIP-437, we should
>> evolve the PTF syntax in collaboration with Apache Calcite.
>>
>> There are different syntax versions out there:
>>
>> 1) Flink
>>
>> SELECT * FROM
>>TABLE(TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES));
>>
>> 2) SQL standard
>>
>> SELECT * FROM
>>TABLE(TUMBLE(TABLE(Bid), DESCRIPTOR(bidtime), INTERVAL '10' MINUTES));
>>
>> 3) Oracle
>>
>> SELECT * FROM
>> TUMBLE(Bid, COLUMNS(bidtime), INTERVAL '10' MINUTES));
>>
>> As you can see above, Flink does not follow the standard correctly as it
>> would need to use `TABLE()` but this is not provided by Calcite yet.
>>
>> I really like the Oracle syntax[2][3] a lot. It reduces necessary
>> keywords to a minimum. Personally, I would like to discuss this syntax
>> in a separate FLIP and hope I will find supporters for:
>>
>>
>> SELECT * FROM
>>TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES);
>>
>> If we go entirely with the Oracle syntax, as you can see in the example,
>> Oracle allows for passing identifiers directly. This would solve our
>> problems for the MODEL as well:
>>
>> SELECT f1, f2, label FROM ML_PREDICT(
>>data => `my_data`,
>>model => `classifier_model`,
>>input => DESCRIPTOR(f1, f2));
>>
>> Or we completely adopt the Oracle syntax:
>>
>> SELECT f1, f2, label FROM ML_PREDICT(
>>data => `my_data`,
>>model => `classifier_model`,
>>input => COLUMNS(f1, f2));
>>
>>
>> What do you think?
>>
>> Happy to create a FLIP for just this syntax question and collaborate
>> with the Calcite community on this. Supporting the syntax of Oracle
>> shouldn't be too hard to convince at least as parser parameter.
>>
>> Regards,
>> Timo
>>
>> [1]
>>
>> https://cwiki.apache.org/confluence/display/FLINK/%5BWIP%5D+FLIP-440%3A+User-defined+Polymorphic+Table+Functions
>> [2]
>>
>> https://docs.oracle.com/en/database/oracle/oracle-database/19/arpls/DBMS_TF.html#GUID-0F66E239-DE77-4C0E-AC76-D5B632AB8072
>> [3] https://oracle-base.com/articles/18c/polymorphic-table-functions-18c
>>
>>
>>
>> On 20.03.24 17:22, Mingge Deng wrote:
>> > Thanks Jark for all the insightful comments.
>> >
>> > We have updated the proposal per our offline discussions:
>> > 1. Model will be treated as a new relation in FlinkSQL.
>> > 2. Include the common ML predict and evaluate functions into the open
>> > source flink to complete the user journey.
>> >  And we should be able to extend the calcite SqlTableFunction to
>> support
>> > these two ML functions.
>> >
>> > Best,
>> > Mingge
>> >
>> > On Mon, Mar 18, 2024 at 7:05 PM Jark Wu  wrote:
>> >
>> >> Hi Hao,
>> >>
>> >>> I meant how the table name
>> >> in window TVF gets translated to `SqlCallingBinding`. Probably we need
>> to
>> >> fetch the table definition from the catalog somewh

Re: [DISCUSS] FLIP-437: Support ML Models in Flink SQL

2024-03-22 Thread Jark Wu
Hi Timo,

Introducing user-defined PTF is very useful in Flink, I'm +1 for this.
But I think the ML model FLIP is not blocked by this, because we
can introduce ML_PREDICT and ML_EVALUATE as built-in PTFs
just like TUMBLE/HOP. And support user-defined ML functions as
a future FLIP.

Regarding the simplified PTF syntax which reduces the outer TABLE()
keyword,
it seems it was just supported[1] by the Calcite community last month and
will be
released in the next version (v1.37). The Calcite community is preparing
the
1.37 release, so we can bump the version if needed in Flink 1.19.

Best,
Jark

[1]: https://issues.apache.org/jira/browse/CALCITE-6254

On Fri, 22 Mar 2024 at 21:46, Timo Walther  wrote:

> Hi everyone,
>
> this is a very important change to the Flink SQL syntax but we can't
> wait until the SQL standard is ready for this. So I'm +1 on introducing
> the MODEL concept as a first class citizen in Flink.
>
> For your information: Over the past months I have already spent a
> significant amount of time thinking about how we can introduce PTFs in
> Flink. I reserved FLIP-440[1] for this purpose and I will share a
> version of this in the next 1-2 weeks.
>
> For a good implementation of FLIP-440 and also FLIP-437, we should
> evolve the PTF syntax in collaboration with Apache Calcite.
>
> There are different syntax versions out there:
>
> 1) Flink
>
> SELECT * FROM
>TABLE(TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES));
>
> 2) SQL standard
>
> SELECT * FROM
>TABLE(TUMBLE(TABLE(Bid), DESCRIPTOR(bidtime), INTERVAL '10' MINUTES));
>
> 3) Oracle
>
> SELECT * FROM
> TUMBLE(Bid, COLUMNS(bidtime), INTERVAL '10' MINUTES));
>
> As you can see above, Flink does not follow the standard correctly as it
> would need to use `TABLE()` but this is not provided by Calcite yet.
>
> I really like the Oracle syntax[2][3] a lot. It reduces necessary
> keywords to a minimum. Personally, I would like to discuss this syntax
> in a separate FLIP and hope I will find supporters for:
>
>
> SELECT * FROM
>TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES);
>
> If we go entirely with the Oracle syntax, as you can see in the example,
> Oracle allows for passing identifiers directly. This would solve our
> problems for the MODEL as well:
>
> SELECT f1, f2, label FROM ML_PREDICT(
>data => `my_data`,
>model => `classifier_model`,
>input => DESCRIPTOR(f1, f2));
>
> Or we completely adopt the Oracle syntax:
>
> SELECT f1, f2, label FROM ML_PREDICT(
>data => `my_data`,
>model => `classifier_model`,
>input => COLUMNS(f1, f2));
>
>
> What do you think?
>
> Happy to create a FLIP for just this syntax question and collaborate
> with the Calcite community on this. Supporting the syntax of Oracle
> shouldn't be too hard to convince at least as parser parameter.
>
> Regards,
> Timo
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/%5BWIP%5D+FLIP-440%3A+User-defined+Polymorphic+Table+Functions
> [2]
>
> https://docs.oracle.com/en/database/oracle/oracle-database/19/arpls/DBMS_TF.html#GUID-0F66E239-DE77-4C0E-AC76-D5B632AB8072
> [3] https://oracle-base.com/articles/18c/polymorphic-table-functions-18c
>
>
>
> On 20.03.24 17:22, Mingge Deng wrote:
> > Thanks Jark for all the insightful comments.
> >
> > We have updated the proposal per our offline discussions:
> > 1. Model will be treated as a new relation in FlinkSQL.
> > 2. Include the common ML predict and evaluate functions into the open
> > source flink to complete the user journey.
> >  And we should be able to extend the calcite SqlTableFunction to
> support
> > these two ML functions.
> >
> > Best,
> > Mingge
> >
> > On Mon, Mar 18, 2024 at 7:05 PM Jark Wu  wrote:
> >
> >> Hi Hao,
> >>
> >>> I meant how the table name
> >> in window TVF gets translated to `SqlCallingBinding`. Probably we need
> to
> >> fetch the table definition from the catalog somewhere. Do we treat those
> >> window TVF specially in parser/planner so that catalog is looked up when
> >> they are seen?
> >>
> >> The table names are resolved and validated by Calcite SqlValidator.  We
> >> don' need to fetch from catalog manually.
> >> The specific checking logic of cumulate window happens in
> >> SqlCumulateTableFunction.OperandMetadataImpl#checkOperandTypes.
> >> The return type of SqlCumulateTableFunction is defined in
> >> #getRowTypeInference() method.
> >> Both are public interfaces provided by Calcite

Re: [DISCUSS] Planning Flink 1.20

2024-03-21 Thread Jark Wu
Thanks for kicking this off.

+1 for the volunteered release managers (Weijie Guo, Rui Fan) and the
targeting date (feature freeze: June 15).

Best,
Jark





On Fri, 22 Mar 2024 at 14:00, Rui Fan <1996fan...@gmail.com> wrote:

> Thanks Leonard for this feedback and help!
>
> Best,
> Rui
>
> On Fri, Mar 22, 2024 at 12:36 PM weijie guo 
> wrote:
>
> > Thanks Leonard!
> >
> > > I'd like to help you if you need some help like permissions from PMC
> > side, please feel free to ping me.
> >
> > Nice to know. It'll help a lot!
> >
> > Best regards,
> >
> > Weijie
> >
> >
> > Leonard Xu  于2024年3月22日周五 12:09写道:
> >
> >> +1 for the proposed release managers (Weijie Guo, Rui Fan), both the two
> >> candidates are pretty active committers thus I believe they know the
> >> community development process well. The recent releases have four
> release
> >> managers, and I am also looking forward to having other volunteers
> >>  join the management of Flink 1.20.
> >>
> >> +1 for targeting date (feature freeze: June 15, 2024), referring to the
> >> release cycle of recent versions, release cycle of 4 months makes sense
> to
> >> me.
> >>
> >>
> >> I'd like to help you if you need some help like permissions from PMC
> >> side, please feel free to ping me.
> >>
> >> Best,
> >> Leonard
> >>
> >>
> >> > 2024年3月19日 下午5:35,Rui Fan <1996fan...@gmail.com> 写道:
> >> >
> >> > Hi Weijie,
> >> >
> >> > Thanks for kicking off 1.20! I'd like to join you and participate in
> the
> >> > 1.20 release.
> >> >
> >> > Best,
> >> > Rui
> >> >
> >> > On Tue, Mar 19, 2024 at 5:30 PM weijie guo  >
> >> > wrote:
> >> >
> >> >> Hi everyone,
> >> >>
> >> >> With the release announcement of Flink 1.19, it's a good time to kick
> >> off
> >> >> discussion of the next release 1.20.
> >> >>
> >> >>
> >> >> - Release managers
> >> >>
> >> >>
> >> >> I'd like to volunteer as one of the release managers this time. It
> has
> >> been
> >> >> good practice to have a team of release managers from different
> >> >> backgrounds, so please raise you hand if you'd like to volunteer and
> >> get
> >> >> involved.
> >> >>
> >> >>
> >> >>
> >> >> - Timeline
> >> >>
> >> >>
> >> >> Flink 1.19 has been released. With a target release cycle of 4
> months,
> >> >> we propose a feature freeze date of *June 15, 2024*.
> >> >>
> >> >>
> >> >>
> >> >> - Collecting features
> >> >>
> >> >>
> >> >> As usual, we've created a wiki page[1] for collecting new features in
> >> 1.20.
> >> >>
> >> >>
> >> >> In addition, we already have a number of FLIPs that have been voted
> or
> >> are
> >> >> in the process, including pre-works for version 2.0.
> >> >>
> >> >>
> >> >> In the meantime, the release management team will be finalized in the
> >> next
> >> >> few days, and we'll continue to create Jira Boards and Sync meetings
> >> >> to make it easy
> >> >> for everyone to get an overview and track progress.
> >> >>
> >> >>
> >> >>
> >> >> Best regards,
> >> >>
> >> >> Weijie
> >> >>
> >> >>
> >> >>
> >> >> [1] https://cwiki.apache.org/confluence/display/FLINK/1.20+Release
> >> >>
> >>
> >>
>


Re: [ANNOUNCE] Donation Flink CDC into Apache Flink has Completed

2024-03-20 Thread Jark Wu
Congratulations and welcome!

Best,
Jark

On Thu, 21 Mar 2024 at 10:35, Rui Fan <1996fan...@gmail.com> wrote:

> Congratulations!
>
> Best,
> Rui
>
> On Thu, Mar 21, 2024 at 10:25 AM Hang Ruan  wrote:
>
> > Congrattulations!
> >
> > Best,
> > Hang
> >
> > Lincoln Lee  于2024年3月21日周四 09:54写道:
> >
> >>
> >> Congrats, thanks for the great work!
> >>
> >>
> >> Best,
> >> Lincoln Lee
> >>
> >>
> >> Peter Huang  于2024年3月20日周三 22:48写道:
> >>
> >>> Congratulations
> >>>
> >>>
> >>> Best Regards
> >>> Peter Huang
> >>>
> >>> On Wed, Mar 20, 2024 at 6:56 AM Huajie Wang 
> wrote:
> >>>
> 
>  Congratulations
> 
> 
> 
>  Best,
>  Huajie Wang
> 
> 
> 
>  Leonard Xu  于2024年3月20日周三 21:36写道:
> 
> > Hi devs and users,
> >
> > We are thrilled to announce that the donation of Flink CDC as a
> > sub-project of Apache Flink has completed. We invite you to explore
> the new
> > resources available:
> >
> > - GitHub Repository: https://github.com/apache/flink-cdc
> > - Flink CDC Documentation:
> > https://nightlies.apache.org/flink/flink-cdc-docs-stable
> >
> > After Flink community accepted this donation[1], we have completed
> > software copyright signing, code repo migration, code cleanup,
> website
> > migration, CI migration and github issues migration etc.
> > Here I am particularly grateful to Hang Ruan, Zhongqaing Gong,
> > Qingsheng Ren, Jiabao Sun, LvYanquan, loserwang1024 and other
> contributors
> > for their contributions and help during this process!
> >
> >
> > For all previous contributors: The contribution process has slightly
> > changed to align with the main Flink project. To report bugs or
> suggest new
> > features, please open tickets
> > Apache Jira (https://issues.apache.org/jira).  Note that we will no
> > longer accept GitHub issues for these purposes.
> >
> >
> > Welcome to explore the new repository and documentation. Your
> feedback
> > and contributions are invaluable as we continue to improve Flink CDC.
> >
> > Thanks everyone for your support and happy exploring Flink CDC!
> >
> > Best,
> > Leonard
> > [1] https://lists.apache.org/thread/cw29fhsp99243yfo95xrkw82s5s418ob
> >
> >
>


Re: [VOTE] FLIP-436: Introduce Catalog-related Syntax

2024-03-19 Thread Jark Wu
+1 (binding)

Best,
Jark

On Tue, 19 Mar 2024 at 19:05, Yuepeng Pan  wrote:

> Hi, Yubin
>
>
> Thanks for driving it !
>
> +1 non-binding.
>
>
>
>
>
>
>
> Best,
> Yuepeng Pan.
>
>
>
>
>
>
>
>
> At 2024-03-19 17:56:42, "Yubin Li"  wrote:
> >Hi everyone,
> >
> >Thanks for all the feedback, I'd like to start a vote on the FLIP-436:
> >Introduce Catalog-related Syntax [1]. The discussion thread is here
> >[2].
> >
> >The vote will be open for at least 72 hours unless there is an
> >objection or insufficient votes.
> >
> >[1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-436%3A+Introduce+Catalog-related+Syntax
> >[2] https://lists.apache.org/thread/10k1bjb4sngyjwhmfqfky28lyoo7sv0z
> >
> >Best regards,
> >Yubin
>


Re: [DISCUSS] FLIP-437: Support ML Models in Flink SQL

2024-03-18 Thread Jark Wu
Hi Hao,

> I meant how the table name
in window TVF gets translated to `SqlCallingBinding`. Probably we need to
fetch the table definition from the catalog somewhere. Do we treat those
window TVF specially in parser/planner so that catalog is looked up when
they are seen?

The table names are resolved and validated by Calcite SqlValidator.  We
don' need to fetch from catalog manually.
The specific checking logic of cumulate window happens in
SqlCumulateTableFunction.OperandMetadataImpl#checkOperandTypes.
The return type of SqlCumulateTableFunction is defined in
#getRowTypeInference() method.
Both are public interfaces provided by Calcite and it seems it's not
specially handled in parser/planner.

I didn't try that, but my gut feeling is that the framework is ready to
extend a customized TVF.

> For what model is, I'm wondering if it has to be datatype or relation. Can
it be another kind of citizen parallel to datatype/relation/function/db?
Redshift also supports `show models` operation, so it seems it's treated
specially as well?

If it is an entity only used in catalog scope (e.g., show xxx, create xxx,
drop xxx), it is fine to introduce it.
We have introduced such one before, called Module: "load module", "show
modules" [1].
But if we want to use Model in TVF parameters, it means it has to be a
relation or datatype, because
that is what it only accepts now.

Thanks for sharing the reason of preferring TVF instead of Redshift way. It
sounds reasonable to me.

Best,
Jark

 [1]:
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/modules/

On Fri, 15 Mar 2024 at 13:41, Hao Li  wrote:

> Hi Jark,
>
> Thanks for the pointer. Sorry for the confusion: I meant how the table name
> in window TVF gets translated to `SqlCallingBinding`. Probably we need to
> fetch the table definition from the catalog somewhere. Do we treat those
> window TVF specially in parser/planner so that catalog is looked up when
> they are seen?
>
> For what model is, I'm wondering if it has to be datatype or relation. Can
> it be another kind of citizen parallel to datatype/relation/function/db?
> Redshift also supports `show models` operation, so it seems it's treated
> specially as well? The reasons I don't like Redshift's syntax are:
> 1. It's a bit verbose, you need to think of a model name as well as a
> function name and the function name also needs to be unique.
> 2. More importantly, prediction function isn't the only function that can
> operate on models. There could be a set of inference functions [1] and
> evaluation functions [2] which can operate on models. It's hard to specify
> all of them in model creation.
>
> [1]:
>
> https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-predict
> [2]:
>
> https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-evaluate
>
> Thanks,
> Hao
>
> On Thu, Mar 14, 2024 at 8:18 PM Jark Wu  wrote:
>
> > Hi Hao,
> >
> > > Can you send me some pointers
> > where the function gets the table information?
> >
> > Here is the code of cumulate window type checking [1].
> >
> > > Also is it possible to support  in
> > window functions in addiction to table?
> >
> > Yes. It is not allowed in TVF.
> >
> > Thanks for the syntax links of other systems. The reason I prefer the
> > Redshift way is
> > that it avoids introducing Model as a relation or datatype (referenced
> as a
> > parameter in TVF).
> > Model is not a relation because it can be queried directly (e.g., SELECT
> *
> > FROM model).
> > I'm also confused about making Model as a datatype, because I don't know
> > what class the
> > model parameter of the eval method of TableFunction/ScalarFunction should
> > be. By defining
> > the function with the model, users can directly invoke the function
> without
> > reference to the model name.
> >
> > Best,
> > Jark
> >
> > [1]:
> >
> >
> https://github.com/apache/flink/blob/d6c7eee8243b4fe3e593698f250643534dc79cb5/flink-table/flink-table-planner/src/main/java/org/apache/flink/table/planner/functions/sql/SqlCumulateTableFunction.java#L53
> >
> > On Fri, 15 Mar 2024 at 02:48, Hao Li  wrote:
> >
> > > Hi Jark,
> > >
> > > Thanks for the pointers. It's very helpful.
> > >
> > > 1. Looks like `tumble`, `hopping` are keywords in calcite parser. And
> the
> > > syntax `cumulate(Table my_table, ...)` needs to get table information
> > from
> > > catalog somewhere for type validation etc. Can you send me some
> pointers
> > > where the function gets the tab

Re: [ANNOUNCE] Apache Flink 1.19.0 released

2024-03-18 Thread Jark Wu
Congrats! Thanks Lincoln, Jing, Yun and Martijn driving this release.
Thanks all who involved this release!

Best,
Jark


On Mon, 18 Mar 2024 at 16:31, Rui Fan <1996fan...@gmail.com> wrote:

> Congratulations, thanks for the great work!
>
> Best,
> Rui
>
> On Mon, Mar 18, 2024 at 4:26 PM Lincoln Lee 
> wrote:
>
> > The Apache Flink community is very happy to announce the release of
> Apache
> > Flink 1.19.0, which is the fisrt release for the Apache Flink 1.19
> series.
> >
> > Apache Flink® is an open-source stream processing framework for
> > distributed, high-performing, always-available, and accurate data
> streaming
> > applications.
> >
> > The release is available for download at:
> > https://flink.apache.org/downloads.html
> >
> > Please check out the release blog post for an overview of the
> improvements
> > for this bugfix release:
> >
> >
> https://flink.apache.org/2024/03/18/announcing-the-release-of-apache-flink-1.19/
> >
> > The full release notes are available in Jira:
> >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12353282
> >
> > We would like to thank all contributors of the Apache Flink community who
> > made this release possible!
> >
> >
> > Best,
> > Yun, Jing, Martijn and Lincoln
> >
>


Re: Re: [DISCUSS] FLIP-436: Introduce "SHOW CREATE CATALOG" Syntax

2024-03-18 Thread Jark Wu
Hi Yubin,

Thanks for the quick response. The suggestion sounds good to me!

Best,
Jark

On Mon, 18 Mar 2024 at 13:06, Yubin Li  wrote:

> Hi Jark,
>
> Good pointing! Thanks for your reply, there are some details to align :)
>
> 1. I think the purpose of DESCRIBE CATALOG is to display metadata
> > information including catalog name,
> > catalog comment (may be introduced in the future), catalog type, and
> > catalog properties (for example [1])
>
> Adopting { DESC | DESCRIBE } CATALOG [ EXTENDED ] xx as formal syntax,
> Producing rich and compatible results for future needs is very important.
> When
> specifying "extended" in the syntax, it will output the complete
> information including
> properties.The complete output example is as follows:
>
> +-+---+
> | catalog_description_item | catalog_description_value   |
>
> +-+---+
> |   Name | cat1
>   |
> |   Type   | generic_in_memory
>|
> |   Comment   |
>   |
> |   Properties  |((k1,v1), (k2,v2))
> |
>
> +-+---+
>
> 2. Could you add support for ALTER CATALOG xxx UNSET ('mykey')? This is
> > also very useful in ALTER TABLE.
>
> I found that there is already an ALTER TABLE xxx RESET ('mykey') syntax [1]
> now,
> which will reset the myKey attribute of a certain table to the default
> value. For catalogs,
> it might be better to use ALTER CATALOG xxx RESET ('mykey') for the sake of
> design
> consistency.
>
> WDYT? Looking forward to your suggestions.
>
> Best,
> Yubin
>
> [1]
>
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/alter/#reset
>
>
> On Mon, Mar 18, 2024 at 11:49 AM Jark Wu  wrote:
>
> > Hi Yubin,
> >
> > Thanks for updating the FLIP. The updated version looks good in general.
> > I only have 2 minor comments.
> >
> > 1. I think the purpose of DESCRIBE CATALOG is to display metadata
> > information including catalog name,
> > catalog comment (may be introduced in the future), catalog type, and
> > catalog properties (for example [1]).
> > Expanding all properties may limit this syntax to include more metadata
> > information in the future.
> >
> > 2. Could you add support for ALTER CATALOG xxx UNSET ('mykey')? This is
> > also very useful in ALTER TABLE.
> >
> > Best,
> > Jark
> >
> > [1]:
> >
> >
> https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-aux-describe-schema.html
> >
> >
> >
> > On Fri, 15 Mar 2024 at 12:06, Yubin Li  wrote:
> >
> > > Hi Xuyang,
> > >
> > > Thank you for pointing this out, The parser part of `describe catalog`
> > > syntax
> > > has indeed been implemented in FLIP-69, and it is not actually
> available.
> > > we can complete the syntax in this FLIP [1].  I have updated the doc :)
> > >
> > > Best,
> > > Yubin
> > >
> > > [1]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-436%3A+Introduce+Catalog-related+Syntax
> > >
> > > On Fri, Mar 15, 2024 at 10:12 AM Xuyang  wrote:
> > >
> > > > Hi, Yubin. Big +1 for this Flip. I just left one minor comment
> > following.
> > > >
> > > >
> > > > I found that although flink has not supported syntax 'DESCRIBE
> CATALOG
> > > > catalog_name' currently, it was already
> > > > discussed in flip-69[1], do we need to restart discussing it?
> > > > I don't have a particular preference regarding the restart
> discussion.
> > It
> > > > seems that there is no difference on this syntax
> > > > in FLIP-436, so maybe it would be best to refer back to FLIP-69 in
> this
> > > > FLIP. WDYT?
> > > >
> > > >
> > > > [1]
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-69%3A+Flink+SQL+DDL+Enhancement
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Best!
> > > > Xuyang
> > > >
> > > >
> >

Re: Re: [DISCUSS] FLIP-436: Introduce "SHOW CREATE CATALOG" Syntax

2024-03-17 Thread Jark Wu
! +1 for the proposal.
> > >> >>
> > >> >> I also think it makes sense to group the missing catalog related
> > >> >> SQL syntaxes under this FLIP.
> > >> >>
> > >> >> Looking forward to these features!
> > >> >>
> > >> >> Best,
> > >> >> Ferenc
> > >> >>
> > >> >>
> > >> >>
> > >> >>
> > >> >> On Thursday, March 14th, 2024 at 08:31, Jane Chan <
> > >> qingyue@gmail.com> wrote:
> > >> >>
> > >> >>>
> > >> >>>
> > >> >>> Hi Yubin,
> > >> >>>
> > >> >>> Thanks for leading the discussion. I'm +1 for the FLIP.
> > >> >>>
> > >> >>> As Jark said, it's a good opportunity to enhance the syntax for
> > Catalog
> > >> >>> from a more comprehensive perspective. So, I suggest expanding the
> > >> scope of
> > >> >>> this FLIP by focusing on the mechanism instead of one use case to
> > >> enhance
> > >> >>> the overall functionality. WDYT?
> > >> >>>
> > >> >>> Best,
> > >> >>> Jane
> > >> >>>
> > >> >>> On Thu, Mar 14, 2024 at 11:38 AM Hang Ruan ruanhang1...@gmail.com
> > >> wrote:
> > >> >>>
> > >> >>>> Hi, Yubin.
> > >> >>>>
> > >> >>>> Thanks for the FLIP. +1 for it.
> > >> >>>>
> > >> >>>> Best,
> > >> >>>> Hang
> > >> >>>>
> > >> >>>> Yubin Li lyb5...@gmail.com 于2024年3月14日周四 10:15写道:
> > >> >>>>
> > >> >>>>> Hi Jingsong, Feng, and Jeyhun
> > >> >>>>>
> > >> >>>>> Thanks for your support and feedback!
> > >> >>>>>
> > >> >>>>>> However, could we add a new method `getCatalogDescriptor()` to
> > >> >>>>>> CatalogManager instead of directly exposing CatalogStore?
> > >> >>>>>
> > >> >>>>> Good point, Besides the audit tracking issue, The proposed
> feature
> > >> >>>>> only requires `getCatalogDescriptor()` function. Exposing
> > components
> > >> >>>>> with excessive functionality will bring unnecessary risks, I
> have
> > >> made
> > >> >>>>> modifications in the FLIP doc [1]. Thank Feng :)
> > >> >>>>>
> > >> >>>>>> Showing the SQL parser implementation in the FLIP for the SQL
> > syntax
> > >> >>>>>> might be a bit confusing. Also, the formal definition is
> missing
> > for
> > >> >>>>>> this SQL clause.
> > >> >>>>>
> > >> >>>>> Thank Jeyhun for pointing it out :) I have updated the doc [1] .
> > >> >>>>>
> > >> >>>>> [1]
> > >> >>>>
> > >> >>>>
> > >>
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=296290756
> > >> >>>>
> > >> >>>>> Best,
> > >> >>>>> Yubin
> > >> >>>>>
> > >> >>>>> On Thu, Mar 14, 2024 at 2:18 AM Jeyhun Karimov
> > je.kari...@gmail.com
> > >> >>>>> wrote:
> > >> >>>>>
> > >> >>>>>> Hi Yubin,
> > >> >>>>>>
> > >> >>>>>> Thanks for the proposal. +1 for it.
> > >> >>>>>> I have one comment:
> > >> >>>>>>
> > >> >>>>>> I would like to see the SQL syntax for the proposed statement.
> > >> Showing
> > >> >>>>>> the
> > >> >>>>>> SQL parser implementation in the FLIP
> > >> >>>>>> for the SQL syntax might be a bit confusing. Also, the formal
> > >> >>>>>> definition
> > >> >>>>>> is
> > >> >>>>>> missing for this SQL clause.
&g

Re: [DISCUSS] FLIP-437: Support ML Models in Flink SQL

2024-03-14 Thread Jark Wu
Hi Hao,

> Can you send me some pointers
where the function gets the table information?

Here is the code of cumulate window type checking [1].

> Also is it possible to support  in
window functions in addiction to table?

Yes. It is not allowed in TVF.

Thanks for the syntax links of other systems. The reason I prefer the
Redshift way is
that it avoids introducing Model as a relation or datatype (referenced as a
parameter in TVF).
Model is not a relation because it can be queried directly (e.g., SELECT *
FROM model).
I'm also confused about making Model as a datatype, because I don't know
what class the
model parameter of the eval method of TableFunction/ScalarFunction should
be. By defining
the function with the model, users can directly invoke the function without
reference to the model name.

Best,
Jark

[1]:
https://github.com/apache/flink/blob/d6c7eee8243b4fe3e593698f250643534dc79cb5/flink-table/flink-table-planner/src/main/java/org/apache/flink/table/planner/functions/sql/SqlCumulateTableFunction.java#L53

On Fri, 15 Mar 2024 at 02:48, Hao Li  wrote:

> Hi Jark,
>
> Thanks for the pointers. It's very helpful.
>
> 1. Looks like `tumble`, `hopping` are keywords in calcite parser. And the
> syntax `cumulate(Table my_table, ...)` needs to get table information from
> catalog somewhere for type validation etc. Can you send me some pointers
> where the function gets the table information?
> 2. The ideal syntax for model function I think would be `ML_PREDICT(MODEL
> , {table  | (query_stmt) })`. I think with special
> handling of the `ML_PREDICT` function in parser/planner, maybe we can do
> this like window functions. But to support `MODEL` keyword, we need calcite
> parser change I guess. Also is it possible to support  in
> window functions in addiction to table?
>
> For the redshift syntax, I'm not sure the purpose of defining the function
> name with the model. Is it to define the function input/output schema? We
> have the schema in our create model syntax and the `ML_PREDICT` can handle
> it by getting model definition. I think our syntax is more concise to have
> a generic prediction function. I also did some research and it's the syntax
> used by Databricks `ai_query` [1], Snowflake `predict` [2], Azureml
> `predict` [3].
>
> [1]:
> https://docs.databricks.com/en/sql/language-manual/functions/ai_query.html
> [2]:
>
> https://github.com/Snowflake-Labs/sfguide-intro-to-machine-learning-with-snowpark-ml-for-python/blob/main/3_snowpark_ml_model_training_inference.ipynb?_fsi=sksXUwQ0
> [3]:
>
> https://learn.microsoft.com/en-us/sql/machine-learning/tutorials/quickstart-python-train-score-model?view=azuresqldb-mi-current
>
> Thanks,
> Hao
>
> On Wed, Mar 13, 2024 at 8:57 PM Jark Wu  wrote:
>
> > Hi Mingge, Hao,
> >
> > Thanks for your replies.
> >
> > > PTF is actually the ideal approach for model functions, and we do have
> > the plans to use PTF for
> > all model functions (including prediction, evaluation etc..) once the PTF
> > is supported in FlinkSQL
> > confluent extension.
> >
> > It sounds that PTF is the ideal way and table function is a temporary
> > solution which will be dropped in the future.
> > I'm not sure whether we can implement it using PTF in Flink SQL. But we
> > have implemented window
> > functions using PTF[1]. And introduced a new window function (called
> > CUMULATE[2]) in Flink SQL based
> > on this. I think it might work to use PTF and implement model function
> > syntax like this:
> >
> > SELECT * FROM TABLE(ML_PREDICT(
> >   TABLE my_table,
> >   my_model,
> >   col1,
> >   col2
> > ));
> >
> > Besides, did you consider following the way of AWS Redshift which defines
> > model function with the model itself together?
> > IIUC, a model is a black-box which defines input parameters and output
> > parameters which can be modeled into functions.
> >
> >
> > Best,
> > Jark
> >
> > [1]:
> >
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/queries/window-tvf/#session
> > [2]:
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-145%3A+Support+SQL+windowing+table-valued+function#FLIP145:SupportSQLwindowingtablevaluedfunction-CumulatingWindows
> > [3]:
> >
> >
> https://github.com/aws-samples/amazon-redshift-ml-getting-started/blob/main/use-cases/bring-your-own-model-remote-inference/README.md#create-model
> >
> >
> >
> >
> > On Wed, 13 Mar 2024 at 15:00, Hao Li  wrote:
> >
> > > Hi Jark,
> > >
> > > Thanks for your questions. These are good questions!
> &g

Re: [DISCUSS] Add "Special Thanks" Page on the Flink Website

2024-03-14 Thread Jark Wu
The pull request has been merged. Thank you for the discussion and
reviewing.
The page is live now: https://flink.apache.org/what-is-flink/special-thanks/

Best,
Jark

On Tue, 12 Mar 2024 at 17:44, Jark Wu  wrote:

> I have created a JIRA issue and opened a pull request for this:
> https://github.com/apache/flink-web/pull/725.
>
> Best,
> Jark
>
> On Tue, 12 Mar 2024 at 16:56, Jark Wu  wrote:
>
>> Thank you all for your feedback. If there are no other concerns or
>> objections,
>> I'm going to create a pull request to add the Special Thanks page.
>>
>> Further feedback and sponsors to be added are still welcome!
>>
>> Best,
>> Jark
>>
>> On Mon, 11 Mar 2024 at 23:09, Maximilian Michels  wrote:
>>
>>> Hi Jark,
>>>
>>> Thanks for clarifying. At first sight, such a page indicated general
>>> sponsorship. +1 for a Thank You page to list specific monetary
>>> contributions to the project for resources which are actively used or
>>> were actively used in the past.
>>>
>>> Cheers,
>>> Max
>>>
>>> On Fri, Mar 8, 2024 at 11:55 AM Martijn Visser 
>>> wrote:
>>> >
>>> > Hi all,
>>> >
>>> > I'm +1 on it. As long as we follow the ASF rules on this, we can thank
>>> > those that are/have made contributions.
>>> >
>>> > Best regards,
>>> >
>>> > Martijn
>>> >
>>> > On Wed, Mar 6, 2024 at 7:45 AM Jark Wu  wrote:
>>> >
>>> > > Hi Matthias,
>>> > >
>>> > > Thanks for your comments! Please see my reply inline.
>>> > >
>>> > > > What do we do if we have enough VMs? Do we still allow
>>> > > companies to add more VMs to the pool even though it's not adding any
>>> > > value?
>>> > >
>>> > > The ASF policy[1] makes it very clear: "Project Thanks pages are to
>>> show
>>> > > appreciation
>>> > > for goods that the project truly needs, not just for goods that
>>> someone
>>> > > wants to donate."
>>> > > Therefore, the community should reject new VMs if it is enough.
>>> > >
>>> > >
>>> > > > The community lacks the openly accessible tools to monitor the VM
>>> usage
>>> > > independently
>>> > > as far as I know (the Azure Pipelines project is owned by Ververica
>>> right
>>> > > now).
>>> > >
>>> > > The Azure pipeline account is sponsored by Ververica, and is managed
>>> by the
>>> > > community.
>>> > > AFAIK, Chesnay and Robert both have admin permissions [2] to the
>>> Azure
>>> > > pipeline project.
>>> > > Others can contact the managers to get access to the environment.
>>> > >
>>> > > > I figured that there could be a chance for us to
>>> > > rely on Apache-provided infrastructure entirely with our current
>>> workload
>>> > > when switching over from Azure Pipelines.
>>> > >
>>> > > That sounds great. We can return back the VMs and mark the donations
>>> as
>>> > > historical
>>> > > on the Thank Page once the new GitHub Actions CI is ready.
>>> > >
>>> > > > I am fine with creating a Thank You page to acknowledge the
>>> financial
>>> > > contributions from Alibaba and Ververica in the past (since Apache
>>> allows
>>> > > historical donations) considering that the contributions of the two
>>> > > companies go way back in time and are quite significant in my
>>> opinion. I
>>> > > suggest focusing on the past for now because of the option to
>>> migrate to
>>> > > Apache infrastructure midterm.
>>> > >
>>> > > Sorry, do you mean we only mention past donations for now?
>>> > > IIUC, the new GitHub Actions might be ready after the end of v1.20,
>>> which
>>> > > probably be in half a year.
>>> > > I'm worried that if we say the sponsorship is ongoing until now (but
>>> it's
>>> > > not), it will confuse
>>> > > people and disrespect the sponsor.
>>> > >
>>> > > Besides, I'm not sure whether the new GitHub Actions CI will replace
>>> the
>>> > > machines 

Re: [DISCUSS] FLIP-437: Support ML Models in Flink SQL

2024-03-13 Thread Jark Wu
Hi Mingge, Hao,

Thanks for your replies.

> PTF is actually the ideal approach for model functions, and we do have
the plans to use PTF for
all model functions (including prediction, evaluation etc..) once the PTF
is supported in FlinkSQL
confluent extension.

It sounds that PTF is the ideal way and table function is a temporary
solution which will be dropped in the future.
I'm not sure whether we can implement it using PTF in Flink SQL. But we
have implemented window
functions using PTF[1]. And introduced a new window function (called
CUMULATE[2]) in Flink SQL based
on this. I think it might work to use PTF and implement model function
syntax like this:

SELECT * FROM TABLE(ML_PREDICT(
  TABLE my_table,
  my_model,
  col1,
  col2
));

Besides, did you consider following the way of AWS Redshift which defines
model function with the model itself together?
IIUC, a model is a black-box which defines input parameters and output
parameters which can be modeled into functions.


Best,
Jark

[1]:
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/queries/window-tvf/#session
[2]:
https://cwiki.apache.org/confluence/display/FLINK/FLIP-145%3A+Support+SQL+windowing+table-valued+function#FLIP145:SupportSQLwindowingtablevaluedfunction-CumulatingWindows
[3]:
https://github.com/aws-samples/amazon-redshift-ml-getting-started/blob/main/use-cases/bring-your-own-model-remote-inference/README.md#create-model




On Wed, 13 Mar 2024 at 15:00, Hao Li  wrote:

> Hi Jark,
>
> Thanks for your questions. These are good questions!
>
> 1. The polymorphism table function I was referring to takes a table as
> input and outputs a table. So the syntax would be like
> ```
> SELECT * FROM ML_PREDICT('model', (SELECT * FROM my_table))
> ```
> As far as I know, this is not supported yet on Flink. So before it's
> supported, one option for the predict function is using table function
> which can output multiple columns
> ```
> SELECT * FROM my_table, LATERAL VIEW (ML_PREDICT('model', col1, col2))
> ```
>
> 2. Good question. Type inference is hard for the `ML_PREDICT` function
> because it takes a model name string as input. I can think of three ways of
> doing type inference for it.
>1). Treat `ML_PREDICT` function as something special and during sql
> parsing or planning time, if it's encountered, we need to look up the model
> from the first argument which is a model name from catalog. Then we can
> infer the input/output for the function.
>2). We can define a `model` keyword and use that in the predict function
> to indicate the argument refers to a model. So it's like `ML_PREDICT(model
> 'my_model', col1, col2))`
>3). We can create a special type of table function maybe called
> `ModelFunction` which can resolve the model type inference by special
> handling it during parsing or planning time.
> 1) is hacky, 2) isn't supported in Flink for function, 3) might be a
> good option.
>
> 3. I sketched the `ML_PREDICT` function for inference. But there are
> limitations of the function mentioned in 1 and 2. So maybe we don't need to
> introduce them as built-in functions until polymorphism table function and
> we can properly deal with type inference.
> After that, defining a user-defined model function should also be
> straightforward.
>
> 4. For model types, do you mean 'remote', 'import', 'native' models or
> other things?
>
> 5. We could support popular providers such as 'azureml', 'vertexai',
> 'googleai' as long as we support the `ML_PREDICT` function. Users should be
> able to implement 3rd-party providers if they can implement a function
> handling the input/output for the provider.
>
> I think for the model functions, there are still dependencies or hacks we
> need to sort out as a built-in function. Maybe we can separate that as a
> follow up if we want to have it built-in and focus on the model syntax for
> this FLIP?
>
> Thanks,
> Hao
>
> On Tue, Mar 12, 2024 at 10:33 PM Jark Wu  wrote:
>
> > Hi Minge, Chris, Hao,
> >
> > Thanks for proposing this interesting idea. I think this is a nice step
> > towards
> > the AI world for Apache Flink. I don't know much about AI/ML, so I may
> have
> > some stupid questions.
> >
> > 1. Could you tell more about why polymorphism table function (PTF)
> doesn't
> > work and do we have plan to use PTF as model functions?
> >
> > 2. What kind of object does the model map to in SQL? A relation or a data
> > type?
> > It looks like a data type because we use it as a parameter of the table
> > function.
> > If it is a data type, how does it cooperate with type inferen

Re: [DISCUSS] FLIP-436: Introduce "SHOW CREATE CATALOG" Syntax

2024-03-13 Thread Jark Wu
Thank you Yubin,

+1 for the proposal. We have been lacking catalog related syntax to operate
catalogs.
It's a good chance to complete the syntax as we have introduced
CatalogStore.

>From what I can see, some useful commands are still missing for catalogs,
such as, alter catalog, describe catalog. What do you think about including
these
syntaxes in the FLIP as well?

Best,
Jark



On Thu, 14 Mar 2024 at 10:16, Yubin Li  wrote:

> Hi Jingsong, Feng, and Jeyhun
>
> Thanks for your support and feedback!
>
> > However, could we add a new method `getCatalogDescriptor()` to
> > CatalogManager instead of directly exposing CatalogStore?
>
> Good point, Besides the audit tracking issue, The proposed feature
> only requires `getCatalogDescriptor()` function. Exposing components
> with excessive functionality will bring unnecessary risks, I have made
> modifications in the FLIP doc [1]. Thank Feng :)
>
> > Showing the SQL parser implementation in the FLIP for the SQL syntax
> > might be a bit confusing. Also, the formal definition is missing for
> > this SQL clause.
>
> Thank Jeyhun for pointing it out :) I have updated the doc [1] .
>
> [1]
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=296290756
>
> Best,
> Yubin
>
>
> On Thu, Mar 14, 2024 at 2:18 AM Jeyhun Karimov 
> wrote:
> >
> > Hi Yubin,
> >
> > Thanks for the proposal. +1 for it.
> > I have one comment:
> >
> > I would like to see the SQL syntax for the proposed statement.  Showing
> the
> > SQL parser implementation in the FLIP
> > for the SQL syntax might be a bit confusing. Also, the formal definition
> is
> > missing for this SQL clause.
> > Maybe something like [1] might be useful. WDYT?
> >
> > Regards,
> > Jeyhun
> >
> > [1]
> >
> https://github.com/apache/flink/blob/0da60ca1a4754f858cf7c52dd4f0c97ae0e1b0cb/docs/content/docs/dev/table/sql/show.md?plain=1#L620-L632
> >
> > On Wed, Mar 13, 2024 at 3:28 PM Feng Jin  wrote:
> >
> > > Hi Yubin
> > >
> > > Thank you for initiating this FLIP.
> > >
> > > I have just one minor question:
> > >
> > > I noticed that we added a new function `getCatalogStore` to expose
> > > CatalogStore, and it seems fine.
> > > However, could we add a new method `getCatalogDescriptor()` to
> > > CatalogManager instead of directly exposing CatalogStore?
> > > By only providing the `getCatalogDescriptor()` interface, it may be
> easier
> > > for us to implement audit tracking in CatalogManager in the future.
> WDYT ?
> > > Although we have only collected some modified events at the moment.[1]
> > >
> > >
> > > [1].
> > >
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-294%3A+Support+Customized+Catalog+Modification+Listener
> > >
> > > Best,
> > > Feng
> > >
> > > On Wed, Mar 13, 2024 at 5:31 PM Jingsong Li 
> > > wrote:
> > >
> > > > +1 for this.
> > > >
> > > > We are missing a series of catalog related syntaxes.
> > > > Especially after the introduction of catalog store. [1]
> > > >
> > > > [1]
> > > >
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-295%3A+Support+lazy+initialization+of+catalogs+and+persistence+of+catalog+configurations
> > > >
> > > > Best,
> > > > Jingsong
> > > >
> > > > On Wed, Mar 13, 2024 at 5:09 PM Yubin Li  wrote:
> > > > >
> > > > > Hi devs,
> > > > >
> > > > > I'd like to start a discussion about FLIP-436: Introduce "SHOW
> CREATE
> > > > > CATALOG" Syntax [1].
> > > > >
> > > > > At present, the `SHOW CREATE TABLE` statement provides strong
> support
> > > for
> > > > > users to easily
> > > > > reuse created tables. However, despite the increasing importance
> of the
> > > > > `Catalog` in user's
> > > > > business, there is no similar statement for users to use.
> > > > >
> > > > > According to the online discussion in FLINK-24939 [2] with Jark Wu
> and
> > > > Feng
> > > > > Jin, since `CatalogStore`
> > > > > has been introduced in FLIP-295 [3], we could use this component to
> > > > > implement such a long-awaited
> > > > > feature, Please refer to the document [1] for implementation
> details.
> > > > >
> >

Re: [DISCUSS] FLIP-437: Support ML Models in Flink SQL

2024-03-12 Thread Jark Wu
Hi Minge, Chris, Hao,

Thanks for proposing this interesting idea. I think this is a nice step
towards
the AI world for Apache Flink. I don't know much about AI/ML, so I may have
some stupid questions.

1. Could you tell more about why polymorphism table function (PTF) doesn't
work and do we have plan to use PTF as model functions?

2. What kind of object does the model map to in SQL? A relation or a data
type?
It looks like a data type because we use it as a parameter of the table
function.
If it is a data type, how does it cooperate with type inference[1]?

3. What built-in model functions will we support? How to define a
user-defined model function?

4. What built-in model types will we support? How to define a user-defined
model type?

5. Regarding the remote model, what providers will we support? Can users
implement
3rd-party providers except OpenAI?

Best,
Jark

[1]:
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/functions/udfs/#type-inference




On Wed, 13 Mar 2024 at 05:55, Hao Li  wrote:

> Hi, Dev
>
>
> Mingge, Chris and I would like to start a discussion about FLIP-437:
> Support ML Models in Flink SQL.
>
> This FLIP is proposing to support machine learning models in Flink SQL
> syntax so that users can CRUD models with Flink SQL and use models on Flink
> to do prediction with Flink data. The FLIP also proposes new model entities
> and changes to catalog interface to support model CRUD operations in
> catalog.
>
> For more details, see FLIP-437 [1]. Looking forward to your feedback.
>
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-437%3A+Support+ML+Models+in+Flink+SQL
>
> Thanks,
> Minge, Chris & Hao
>


Re: [DISCUSS] Add "Special Thanks" Page on the Flink Website

2024-03-12 Thread Jark Wu
I have created a JIRA issue and opened a pull request for this:
https://github.com/apache/flink-web/pull/725.

Best,
Jark

On Tue, 12 Mar 2024 at 16:56, Jark Wu  wrote:

> Thank you all for your feedback. If there are no other concerns or
> objections,
> I'm going to create a pull request to add the Special Thanks page.
>
> Further feedback and sponsors to be added are still welcome!
>
> Best,
> Jark
>
> On Mon, 11 Mar 2024 at 23:09, Maximilian Michels  wrote:
>
>> Hi Jark,
>>
>> Thanks for clarifying. At first sight, such a page indicated general
>> sponsorship. +1 for a Thank You page to list specific monetary
>> contributions to the project for resources which are actively used or
>> were actively used in the past.
>>
>> Cheers,
>> Max
>>
>> On Fri, Mar 8, 2024 at 11:55 AM Martijn Visser 
>> wrote:
>> >
>> > Hi all,
>> >
>> > I'm +1 on it. As long as we follow the ASF rules on this, we can thank
>> > those that are/have made contributions.
>> >
>> > Best regards,
>> >
>> > Martijn
>> >
>> > On Wed, Mar 6, 2024 at 7:45 AM Jark Wu  wrote:
>> >
>> > > Hi Matthias,
>> > >
>> > > Thanks for your comments! Please see my reply inline.
>> > >
>> > > > What do we do if we have enough VMs? Do we still allow
>> > > companies to add more VMs to the pool even though it's not adding any
>> > > value?
>> > >
>> > > The ASF policy[1] makes it very clear: "Project Thanks pages are to
>> show
>> > > appreciation
>> > > for goods that the project truly needs, not just for goods that
>> someone
>> > > wants to donate."
>> > > Therefore, the community should reject new VMs if it is enough.
>> > >
>> > >
>> > > > The community lacks the openly accessible tools to monitor the VM
>> usage
>> > > independently
>> > > as far as I know (the Azure Pipelines project is owned by Ververica
>> right
>> > > now).
>> > >
>> > > The Azure pipeline account is sponsored by Ververica, and is managed
>> by the
>> > > community.
>> > > AFAIK, Chesnay and Robert both have admin permissions [2] to the Azure
>> > > pipeline project.
>> > > Others can contact the managers to get access to the environment.
>> > >
>> > > > I figured that there could be a chance for us to
>> > > rely on Apache-provided infrastructure entirely with our current
>> workload
>> > > when switching over from Azure Pipelines.
>> > >
>> > > That sounds great. We can return back the VMs and mark the donations
>> as
>> > > historical
>> > > on the Thank Page once the new GitHub Actions CI is ready.
>> > >
>> > > > I am fine with creating a Thank You page to acknowledge the
>> financial
>> > > contributions from Alibaba and Ververica in the past (since Apache
>> allows
>> > > historical donations) considering that the contributions of the two
>> > > companies go way back in time and are quite significant in my
>> opinion. I
>> > > suggest focusing on the past for now because of the option to migrate
>> to
>> > > Apache infrastructure midterm.
>> > >
>> > > Sorry, do you mean we only mention past donations for now?
>> > > IIUC, the new GitHub Actions might be ready after the end of v1.20,
>> which
>> > > probably be in half a year.
>> > > I'm worried that if we say the sponsorship is ongoing until now (but
>> it's
>> > > not), it will confuse
>> > > people and disrespect the sponsor.
>> > >
>> > > Besides, I'm not sure whether the new GitHub Actions CI will replace
>> the
>> > > machines for running
>> > > flink-ci mirrors [3] and the flink benchmarks [4]. If not, I think
>> it's
>> > > inappropriate to say they are
>> > > historical donations.
>> > >
>> > > Furthermore, we are collecting all kinds of donations. I just noticed
>> that
>> > > AWS donated [5] service costs
>> > > for flink-connector-aws tests that hit real AWS services. This is an
>> > > ongoing donation and I think it's not
>> > > good to mark it as a historical donation. (Thanks for the donation,
>> AWS,
>> > > @Danny
>>

[jira] [Created] (FLINK-34654) Add "Special Thanks" Page on the Flink Website

2024-03-12 Thread Jark Wu (Jira)
Jark Wu created FLINK-34654:
---

 Summary: Add "Special Thanks" Page on the Flink Website
 Key: FLINK-34654
 URL: https://issues.apache.org/jira/browse/FLINK-34654
 Project: Flink
  Issue Type: New Feature
  Components: Project Website
Reporter: Jark Wu
Assignee: Jark Wu


This issue aims to add a "Special Thanks" page on the Flink website 
(https://flink.apache.org/) to honor and appreciate the companies and 
organizations that have sponsored machines or services for our project.

Discussion thread: 
https://lists.apache.org/thread/y5g0nd5t8h2ql4gq7d0kb9tkwv1wkm1j



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] Add "Special Thanks" Page on the Flink Website

2024-03-12 Thread Jark Wu
Thank you all for your feedback. If there are no other concerns or
objections,
I'm going to create a pull request to add the Special Thanks page.

Further feedback and sponsors to be added are still welcome!

Best,
Jark

On Mon, 11 Mar 2024 at 23:09, Maximilian Michels  wrote:

> Hi Jark,
>
> Thanks for clarifying. At first sight, such a page indicated general
> sponsorship. +1 for a Thank You page to list specific monetary
> contributions to the project for resources which are actively used or
> were actively used in the past.
>
> Cheers,
> Max
>
> On Fri, Mar 8, 2024 at 11:55 AM Martijn Visser 
> wrote:
> >
> > Hi all,
> >
> > I'm +1 on it. As long as we follow the ASF rules on this, we can thank
> > those that are/have made contributions.
> >
> > Best regards,
> >
> > Martijn
> >
> > On Wed, Mar 6, 2024 at 7:45 AM Jark Wu  wrote:
> >
> > > Hi Matthias,
> > >
> > > Thanks for your comments! Please see my reply inline.
> > >
> > > > What do we do if we have enough VMs? Do we still allow
> > > companies to add more VMs to the pool even though it's not adding any
> > > value?
> > >
> > > The ASF policy[1] makes it very clear: "Project Thanks pages are to
> show
> > > appreciation
> > > for goods that the project truly needs, not just for goods that someone
> > > wants to donate."
> > > Therefore, the community should reject new VMs if it is enough.
> > >
> > >
> > > > The community lacks the openly accessible tools to monitor the VM
> usage
> > > independently
> > > as far as I know (the Azure Pipelines project is owned by Ververica
> right
> > > now).
> > >
> > > The Azure pipeline account is sponsored by Ververica, and is managed
> by the
> > > community.
> > > AFAIK, Chesnay and Robert both have admin permissions [2] to the Azure
> > > pipeline project.
> > > Others can contact the managers to get access to the environment.
> > >
> > > > I figured that there could be a chance for us to
> > > rely on Apache-provided infrastructure entirely with our current
> workload
> > > when switching over from Azure Pipelines.
> > >
> > > That sounds great. We can return back the VMs and mark the donations as
> > > historical
> > > on the Thank Page once the new GitHub Actions CI is ready.
> > >
> > > > I am fine with creating a Thank You page to acknowledge the financial
> > > contributions from Alibaba and Ververica in the past (since Apache
> allows
> > > historical donations) considering that the contributions of the two
> > > companies go way back in time and are quite significant in my opinion.
> I
> > > suggest focusing on the past for now because of the option to migrate
> to
> > > Apache infrastructure midterm.
> > >
> > > Sorry, do you mean we only mention past donations for now?
> > > IIUC, the new GitHub Actions might be ready after the end of v1.20,
> which
> > > probably be in half a year.
> > > I'm worried that if we say the sponsorship is ongoing until now (but
> it's
> > > not), it will confuse
> > > people and disrespect the sponsor.
> > >
> > > Besides, I'm not sure whether the new GitHub Actions CI will replace
> the
> > > machines for running
> > > flink-ci mirrors [3] and the flink benchmarks [4]. If not, I think it's
> > > inappropriate to say they are
> > > historical donations.
> > >
> > > Furthermore, we are collecting all kinds of donations. I just noticed
> that
> > > AWS donated [5] service costs
> > > for flink-connector-aws tests that hit real AWS services. This is an
> > > ongoing donation and I think it's not
> > > good to mark it as a historical donation. (Thanks for the donation,
> AWS,
> > > @Danny
> > > Cranmer  @HongTeoh!
> > > We should add it to the Thank Page!)
> > >
> > > Best,
> > > Jark
> > >
> > >
> > > [1]: https://www.apache.org/foundation/marks/linking#projectthanks
> > > [2]:
> > >
> > >
> https://cwiki.apache.org/confluence/display/FLINK/Continuous+Integration#ContinuousIntegration-Contacts
> > >
> > > [3]:
> > >
> > >
> https://cwiki.apache.org/confluence/display/FLINK/Continuous+Integration#ContinuousIntegration-Repositories
> > >
> > > [4]: https://lists.apache.org/thread

Re: [DISCUSS] Add "Special Thanks" Page on the Flink Website

2024-03-06 Thread Jark Wu
tional VMs are not necessary (and with that, the need to have a Thank
> You page as well).
>
> But I acknowledge that Alibaba and Ververica would like to be recognized
> for their financial contributions to the community in the past. Therefore,
> I am fine with creating a Thank You page to acknowledge the financial
> contributions from Alibaba and Ververica in the past (since Apache allows
> historical donations) considering that the contributions of the two
> companies go way back in time and are quite significant in my opinion. I
> suggest focusing on the past for now because of the option to migrate to
> Apache infrastructure midterm.
>
> Best,
> Matthias
>
> [1] https://github.com/apache/flink/graphs/contributors
> [2]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-396%3A+Trial+to+test+GitHub+Actions+as+an+alternative+for+Flink%27s+current+Azure+CI+infrastructure
> [3]
>
> https://cwiki.apache.org/confluence/display/INFRA/Infra+Roundtable+2023-12-06%2C+17%3A00+UTC
>
> On Wed, Mar 6, 2024 at 7:06 AM tison  wrote:
>
> > > a rare way different than
> > > individuals (few individuals can donate such resources)
> >
> > Theoretically, if an individual donates so, we list list him/her as well.
> >
> > I've seen such donations in The Perl Foundation like [1]. But since a
> > PMC doesn't have a fundraising office, we may not accept raw money
> > anyway; it's already out of the thread :D
> >
> > Best,
> > tison.
> >
> > [1] https://news.perlfoundation.org/post/announcement_of_the_ian_hague
> >
> > Yun Tang  于2024年3月6日周三 13:58写道:
> > >
> > > Thanks for Jark's proposal, and I'm +1 for adding such a page.
> > >
> > > The CI infrastructure helps the Apache Flink project to run well. I
> > cannot imagine how insufficient CI machines would impact the development
> > progress, especially when the feature freeze date is close. And I believe
> > that most guys who contributed to the community would not know Alibaba
> and
> > Ververica had ever donated several machines to make the community work
> > smoothly for years.
> > >
> > >
> > > Best
> > > Yun Tang
> > > 
> > > From: Jark Wu 
> > > Sent: Wednesday, March 6, 2024 11:35
> > > To: dev@flink.apache.org 
> > > Subject: Re: [DISCUSS] Add "Special Thanks" Page on the Flink Website
> > >
> > > Hi Max,
> > >
> > > Thank you for your input.
> > >
> > > According to ASF policy[1], the Thank Page is intended to thank third
> > > parties
> > > that provide physical resources like machines, services, and software
> > that
> > > the committers
> > >  or the project truly needs. I agree with Tison, such donation is
> > countable
> > > and that's why
> > > I started this discussion to collect the full list. The thank Page is
> not
> > > intended to thank working
> > > hours or contributions from individual volunteers which I think
> > > is recognized in other ways
> > > (e.g., credit of committer and PMC member).
> > >
> > > Best,
> > > Jark
> > >
> > > [1]: https://www.apache.org/foundation/marks/linking#projectthanks
> > >
> > > On Wed, 6 Mar 2024 at 01:14, tison  wrote:
> > >
> > > > Hi Max,
> > > >
> > > > Thanks for sharing your concerns :D
> > > >
> > > > I'd elaborate a bit on this topic with an example, that Apache
> Airflow
> > > > has a small section for its special sponsor who offers machines for
> CI
> > > > also [1].
> > > >
> > > > In my understanding, companies employ developers to invest time in
> the
> > > > development of Flink and that is large, vague, and hard to be fair to
> > > > list all of the companies.
> > > >
> > > > However, physical resources like CI machines are countable and they
> > > > help the sustainability of our project in a rare way different than
> > > > individuals (few individuals can donate such resources). We can
> > > > maintain such a section or page for those sponsors so that it also
> > > > decreases the friction when the company asks "what we can gain" (for
> > > > explicit credits, at least, and easy understanding).
> > > >
> > > > Any entity is welcome to add themselves as long as it's valid.
> > > >
> > > > For the fair part, 

Re: [DISCUSS] Add "Special Thanks" Page on the Flink Website

2024-03-05 Thread Jark Wu
Hi Max,

Thank you for your input.

According to ASF policy[1], the Thank Page is intended to thank third
parties
that provide physical resources like machines, services, and software that
the committers
 or the project truly needs. I agree with Tison, such donation is countable
and that's why
I started this discussion to collect the full list. The thank Page is not
intended to thank working
hours or contributions from individual volunteers which I think
is recognized in other ways
(e.g., credit of committer and PMC member).

Best,
Jark

[1]: https://www.apache.org/foundation/marks/linking#projectthanks

On Wed, 6 Mar 2024 at 01:14, tison  wrote:

> Hi Max,
>
> Thanks for sharing your concerns :D
>
> I'd elaborate a bit on this topic with an example, that Apache Airflow
> has a small section for its special sponsor who offers machines for CI
> also [1].
>
> In my understanding, companies employ developers to invest time in the
> development of Flink and that is large, vague, and hard to be fair to
> list all of the companies.
>
> However, physical resources like CI machines are countable and they
> help the sustainability of our project in a rare way different than
> individuals (few individuals can donate such resources). We can
> maintain such a section or page for those sponsors so that it also
> decreases the friction when the company asks "what we can gain" (for
> explicit credits, at least, and easy understanding).
>
> Any entity is welcome to add themselves as long as it's valid.
>
> For the fair part, I'm not an employee of both companies listed on the
> demo page and I don't feel uncomfortable. Those companies do invest a
> lot on our project and I'd regard it as a chance to encourage other
> companies to follow.
>
> Best,
> tison.
>
> [1] https://github.com/apache/airflow?tab=readme-ov-file#sponsors
>
> Maximilian Michels  于2024年3月6日周三 00:49写道:
> >
> > I'm a bit torn on this idea. On the one hand, it makes sense to thank
> > sponsors and entities who have supported Flink in the past. On other
> > hand, this list is bound to be incomplete and maybe also biased, even
> > if not intended to be so. I think the power of open-source comes from
> > the unconditional donation of code and knowledge. Infrastructure costs
> > are a reality and donations in that area are meaningful, but they are
> > just one piece of the total sum which consists of many volunteers and
> > working hours. In my eyes, a Thank You page would have to display each
> > entity fairly which is going to be hard to achieve.
> >
> > -Max
> >
> > On Tue, Mar 5, 2024 at 2:30 PM Jingsong Li 
> wrote:
> > >
> > > +1 for setting up
> > >
> > > On Tue, Mar 5, 2024 at 5:39 PM Jing Ge 
> wrote:
> > > >
> > > > +1 and thanks for the proposal!
> > > >
> > > > Best regards,
> > > > Jing
> > > >
> > > > On Tue, Mar 5, 2024 at 10:26 AM tison  wrote:
> > > >
> > > > > I like this idea, so +1 for setting up.
> > > > >
> > > > > For anyone who have the access, this is a related thread about
> > > > > project-wise sponsor in the foundation level [1].
> > > > >
> > > > > Best,
> > > > > tison.
> > > > >
> > > > > [1]
> https://lists.apache.org/thread/2nv0x9gfk9lfnpb2315xgywyx84y97v6
> > > > >
> > > > > Jark Wu  于2024年3月5日周二 17:17写道:
> > > > > >
> > > > > > Sorry, I posted the wrong [7] link. The Flink benchmark ML link
> is:
> > > > > > https://lists.apache.org/thread/bkw6ozoflgltwfwmzjtgx522hyssfko6
> > > > > >
> > > > > >
> > > > > > On Tue, 5 Mar 2024 at 16:56, Jark Wu  wrote:
> > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > I want to propose adding a "Special Thanks" page to our Apache
> Flink
> > > > > website [1]
> > > > > > >
> > > > > > > to honor and appreciate the
> > > > > > > companies and organizations that have sponsored
> > > > > > >
> > > > > > >
> > > > > > > machines or services for our project. The establishment of
> such a page
> > > > > serves as
> > > > > > >
> > > > > > >
> > > > > > > a pub

Re: [DISCUSS] Add "Special Thanks" Page on the Flink Website

2024-03-05 Thread Jark Wu
Sorry, I posted the wrong [7] link. The Flink benchmark ML link is:
https://lists.apache.org/thread/bkw6ozoflgltwfwmzjtgx522hyssfko6


On Tue, 5 Mar 2024 at 16:56, Jark Wu  wrote:

> Hi all,
>
>
>
> I want to propose adding a "Special Thanks" page to our Apache Flink website 
> [1]
>
> to honor and appreciate the
> companies and organizations that have sponsored
>
>
> machines or services for our project. The establishment of such a page serves 
> as
>
>
> a public acknowledgment of our sponsors' contributions and simultaneously acts
>
>
> as a positive encouragement for other entities to consider supporting our 
> project.
>
>
>
> Adding Per-Project Thanks pages is allowed by ASF policy[2], which says "PMCs
>
>
> may wish to provide recognition for third parties that provide software or 
> services
>
>
> to the project's committers to further the goals of the project. These are 
> typically
>
> called Per-Project Thanks pages".  Many Apache projects have added such
>
> pages, for example, Apache HBase[3] and Apache Mina[4].
>
>
> To initiate this idea, I have drafted a preliminary page under the
> "About" menu
>
> on the
> Flink website to specifically thank Alibaba and Ververica, by following
>
> the ASF guidelines and the Apache Mina project.
>
>
> page image:
> https://github.com/apache/flink/assets/5378924/e51aaffe-565e-46d1-90af-3900904afcc0
>
>
>
> Below companies are on the thanks list for their donation to Flink testing 
> infrastructure:
>
> - Alibaba donated 8 machines (32vCPU,64GB) for running Flink CI builds [5].
>
>
> - Ververica donated 2 machines for hosting flink-ci repositories [6] and 
> running Flink benchmarks [7].
>
>
> I may miss some other donations or companies, please add them if you know.
>
> Looking forward to your feedback about this proposal!
>
>
> Best,
>
> Jark
>
>
> [1]: https://flink.apache.org/
>
> [2]: https://www.apache.org/foundation/marks/linking#projectthanks
>
> [3]: https://hbase.apache.org/sponsors.html
>
> [4]: https://mina.apache.org/special-thanks.html
>
> [5]:
> https://cwiki.apache.org/confluence/display/FLINK/Azure+Pipelines#AzurePipelines-AvailableCustomBuildMachines
>
> [6]:
> https://cwiki.apache.org/confluence/display/FLINK/Continuous+Integration
>
> [7]:
> https://lists.apache.org/thread.html/41a68c775753a7841896690c75438e0a497634102e676db880f30225@%3Cdev.flink.apache.org%3E
>


[DISCUSS] Add "Special Thanks" Page on the Flink Website

2024-03-05 Thread Jark Wu
Hi all,


I want to propose adding a "Special Thanks" page to our Apache Flink
website [1]

to honor and appreciate the companies and organizations that have sponsored

machines or services for our project. The establishment of such a page
serves as

a public acknowledgment of our sponsors' contributions and simultaneously acts

as a positive encouragement for other entities to consider supporting
our project.


Adding Per-Project Thanks pages is allowed by ASF policy[2], which says "PMCs

may wish to provide recognition for third parties that provide
software or services

to the project's committers to further the goals of the project. These
are typically

called Per-Project Thanks pages".  Many Apache projects have added such

pages, for example, Apache HBase[3] and Apache Mina[4].


To initiate this idea, I have drafted a preliminary page under the
"About" menu

on the
Flink website to specifically thank Alibaba and Ververica, by following

the ASF guidelines and the Apache Mina project.


page image:
https://github.com/apache/flink/assets/5378924/e51aaffe-565e-46d1-90af-3900904afcc0


Below companies are on the thanks list for their donation to Flink
testing infrastructure:

- Alibaba donated 8 machines (32vCPU,64GB) for running Flink CI builds [5].

- Ververica donated 2 machines for hosting flink-ci repositories [6]
and running Flink benchmarks [7].


I may miss some other donations or companies, please add them if you know.

Looking forward to your feedback about this proposal!


Best,

Jark


[1]: https://flink.apache.org/

[2]: https://www.apache.org/foundation/marks/linking#projectthanks

[3]: https://hbase.apache.org/sponsors.html

[4]: https://mina.apache.org/special-thanks.html

[5]:
https://cwiki.apache.org/confluence/display/FLINK/Azure+Pipelines#AzurePipelines-AvailableCustomBuildMachines

[6]:
https://cwiki.apache.org/confluence/display/FLINK/Continuous+Integration

[7]:
https://lists.apache.org/thread.html/41a68c775753a7841896690c75438e0a497634102e676db880f30225@%3Cdev.flink.apache.org%3E


Re: Re: [VOTE] FLIP-377: Support fine-grained configuration to control filter push down for Table/SQL Sources

2024-01-18 Thread Jark Wu
+1 (binding)

Best,
Jark

On Tue, 16 Jan 2024 at 18:01, Xuyang  wrote:

> +1 (non-binding)
>
>
> --
>
> Best!
> Xuyang
>
>
>
>
>
> 在 2024-01-16 17:52:38,"Leonard Xu"  写道:
> >+1 (binding)
> >
> >Best,
> >Leonard
> >
> >> 2024年1月16日 下午5:40,Hang Ruan  写道:
> >>
> >> +1 (non-binding)
> >>
> >> Best,
> >> Hang
> >>
> >> Jiabao Sun  于2024年1月9日周二 19:39写道:
> >>
> >>> Hi Devs,
> >>>
> >>> I'd like to start a vote on FLIP-377: Support fine-grained
> configuration
> >>> to control filter push down for Table/SQL Sources[1]
> >>> which has been discussed in this thread[2].
> >>>
> >>> The vote will be open for at least 72 hours unless there is an
> objection
> >>> or not enough votes.
> >>>
> >>> [1]
> >>>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=276105768
> >>> [2] https://lists.apache.org/thread/nvxx8sp9jm009yywm075hoffr632tm7j
> >>>
> >>> Best,
> >>> Jiabao
>


Re: [VOTE] Accept Flink CDC into Apache Flink

2024-01-08 Thread Jark Wu
+1 (binding)

Best,
Jark

On Tue, 9 Jan 2024 at 15:31, Benchao Li  wrote:

> +1 (non-binding)
>
> Feng Wang  于2024年1月9日周二 15:29写道:
> >
> > +1 non-binding
> > Regards,
> > Feng
> >
> > On Tue, Jan 9, 2024 at 3:05 PM Leonard Xu  wrote:
> >
> > > Hello all,
> > >
> > > This is the official vote whether to accept the Flink CDC code
> contribution
> > >  to Apache Flink.
> > >
> > > The current Flink CDC code, documentation, and website can be
> > > found here:
> > > code: https://github.com/ververica/flink-cdc-connectors <
> > > https://github.com/ververica/flink-cdc-connectors>
> > > docs: https://ververica.github.io/flink-cdc-connectors/ <
> > > https://ververica.github.io/flink-cdc-connectors/>
> > >
> > > This vote should capture whether the Apache Flink community is
> interested
> > > in accepting, maintaining, and evolving Flink CDC.
> > >
> > > Regarding my original proposal[1] in the dev mailing list, I firmly
> believe
> > > that this initiative aligns perfectly with Flink. For the Flink
> community,
> > > it represents an opportunity to bolster Flink's competitive edge in
> > > streaming
> > > data integration, fostering the robust growth and prosperity of the
> Apache
> > > Flink
> > > ecosystem. For the Flink CDC project, becoming a sub-project of Apache
> > > Flink
> > > means becoming an integral part of a neutral open-source community,
> > > capable of
> > > attracting a more diverse pool of contributors.
> > >
> > > All Flink CDC maintainers are dedicated to continuously contributing to
> > > achieve
> > > seamless integration with Flink. Additionally, PMC members like Jark,
> > > Qingsheng,
> > > and I are willing to infacilitate the expansion of contributors and
> > > committers to
> > > effectively maintain this new sub-project.
> > >
> > > This is a "Adoption of a new Codebase" vote as per the Flink bylaws
> [2].
> > > Only PMC votes are binding. The vote will be open at least 7 days
> > > (excluding weekends), meaning until Thursday January 18 12:00 UTC, or
> > > until we
> > > achieve the 2/3rd majority. We will follow the instructions in the
> Flink
> > > Bylaws
> > > in the case of insufficient active binding voters:
> > >
> > > > 1. Wait until the minimum length of the voting passes.
> > > > 2. Publicly reach out via personal email to the remaining binding
> voters
> > > in the
> > > voting mail thread for at least 2 attempts with at least 7 days between
> > > two attempts.
> > > > 3. If the binding voter being contacted still failed to respond after
> > > all the attempts,
> > > the binding voter will be considered as inactive for the purpose of
> this
> > > particular voting.
> > >
> > > Welcome voting !
> > >
> > > Best,
> > > Leonard
> > > [1] https://lists.apache.org/thread/o7klnbsotmmql999bnwmdgo56b6kxx9l
> > > [2]
> > >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120731026
>
>
>
> --
>
> Best,
> Benchao Li
>


Re: [VOTE] Release 1.18.1, release candidate #2

2024-01-08 Thread Jark Wu
Thanks Jing for driving this.

+1 (binding)

- Build and compile the source code locally: *OK*
- Verified signatures and hashes: *OK*
- Checked no missing artifacts in the staging area: *OK*
- Reviewed the website release PR: *OK*
- Went through the quick start: *OK*
  * Started a cluster and ran the examples
  * Verified web ui and log output, nothing unexpected

Best,
Jark

On Thu, 28 Dec 2023 at 20:59, Yun Tang  wrote:

> Thanks Jing for driving this release.
>
> +1 (non-binding)
>
>
>   *
> Download artifacts and verify the signatures.
>   *
> Verified the web PR
>   *
> Verified the number of Python packages is 11
>   *
> Started a local cluster and verified FLIP-291 to see the rescale results.
>   *
> Verified the jar packages were built with JDK8
>
> Best
> Yun Tang
>
>
> 
> From: Rui Fan <1996fan...@gmail.com>
> Sent: Thursday, December 28, 2023 10:54
> To: dev@flink.apache.org 
> Subject: Re: [VOTE] Release 1.18.1, release candidate #2
>
> Thanks Jing for driving this release!
>
> +1(non-binding)
>
> - Downloaded artifacts
> - Verified signatures and sha512
> - The source archives do not contain any binaries
> - Verified web PR
> - Build the source with Maven 3 and java8 (Checked the license as well)
> - bin/start-cluster.sh with java8, it works fine and no any unexpected LOG-
> Ran demo, it's fine:  bin/flink
> runexamples/streaming/StateMachineExample.jar
>
> Best,
> Rui
>
> On Wed, Dec 27, 2023 at 8:45 PM Martijn Visser 
> wrote:
>
> > Hi Jing,
> >
> > Thanks for driving this.
> >
> > +1 (binding)
> >
> > - Validated hashes
> > - Verified signature
> > - Verified that no binaries exist in the source archive
> > - Build the source with Maven via mvn clean install
> > -Pcheck-convergence -Dflink.version=1.18.1
> > - Verified licenses
> > - Verified web PR
> > - Started a cluster and the Flink SQL client, successfully read and
> > wrote with the Kafka connector to Confluent Cloud with AVRO and Schema
> > Registry enabled
> > - Started a cluster and submitted a job that checkpoints to GCS without
> > problems
> >
> > Best regards,
> >
> > Martijn
> >
> > On Thu, Dec 21, 2023 at 4:55 AM gongzhongqiang
> >  wrote:
> > >
> > > Thanks Jing Ge for driving this release.
> > >
> > > +1 (non-binding), I have checked:
> > > [✓] The checksums and signatures are validated
> > > [✓] The tag checked is fine
> > > [✓] Built from source is passed
> > > [✓] The flink-web PR is reviewed and checked
> > >
> > >
> > > Best,
> > > Zhongqiang Gong
> >
>


Re: [DISCUSS] Release Flink 1.18.1

2023-12-11 Thread Jark Wu
Thanks Jing for driving 1.18.1.
+1 for this.

Best,
Jark

On Mon, 11 Dec 2023 at 16:59, Hong Liang  wrote:

> +1. Thanks Jing for driving this.
>
> Hong
>
> On Mon, Dec 11, 2023 at 2:27 AM Yun Tang  wrote:
>
> > Thanks Jing for driving 1.18.1 release, +1 for this.
> >
> >
> > Best
> > Yun Tang
> > 
> > From: Rui Fan <1996fan...@gmail.com>
> > Sent: Saturday, December 9, 2023 21:46
> > To: dev@flink.apache.org 
> > Subject: Re: [DISCUSS] Release Flink 1.18.1
> >
> > Thanks Jing for driving this release, +1
> >
> > Best,
> > Rui
> >
> > On Sat, Dec 9, 2023 at 7:33 AM Leonard Xu  wrote:
> >
> > > Thanks Jing for driving this release, +1
> > >
> > > Best,
> > > Leonard
> > >
> > > > 2023年12月9日 上午1:23,Danny Cranmer  写道:
> > > >
> > > > +1
> > > >
> > > > Thanks for driving this
> > > >
> > > > On Fri, 8 Dec 2023, 12:05 Timo Walther,  wrote:
> > > >
> > > >> Thanks for taking care of this Jing.
> > > >>
> > > >> +1 to release 1.18.1 for this.
> > > >>
> > > >> Cheers,
> > > >> Timo
> > > >>
> > > >>
> > > >> On 08.12.23 10:00, Benchao Li wrote:
> > > >>> I've merged FLINK-33313 to release-1.18 branch.
> > > >>>
> > > >>> Péter Váry  于2023年12月8日周五 16:56写道:
> > > 
> > >  Hi Jing,
> > >  Thanks for taking care of this!
> > >  +1 (non-binding)
> > >  Peter
> > > 
> > >  Sergey Nuyanzin  ezt írta (időpont: 2023.
> dec.
> > > >> 8., P,
> > >  9:36):
> > > 
> > > > Thanks Jing driving it
> > > > +1
> > > >
> > > > also +1 to include FLINK-33313 mentioned by Benchao Li
> > > >
> > > > On Fri, Dec 8, 2023 at 9:17 AM Benchao Li 
> > > >> wrote:
> > > >
> > > >> Thanks Jing for driving 1.18.1 releasing.
> > > >>
> > > >> I would like to include FLINK-33313[1] in 1.18.1, it's just a
> > > bugfix,
> > > >> not a blocker, but it's already merged into master, I plan to
> > merge
> > > it
> > > >> to 1.8/1.7 branches today after the CI passes.
> > > >>
> > > >> [1] https://issues.apache.org/jira/browse/FLINK-33313
> > > >>
> > > >> Jing Ge  于2023年12月8日周五 16:06写道:
> > > >>>
> > > >>> Hi all,
> > > >>>
> > > >>> I would like to discuss creating a new 1.18 patch release
> > (1.18.1).
> > > >> The
> > > >>> last 1.18 release is nearly two months old, and since then, 37
> > > >> tickets
> > > >> have
> > > >>> been closed [1], of which 6 are blocker/critical [2].  Some of
> > them
> > > >> are
> > > >>> quite important, such as FLINK-33598 [3]
> > > >>>
> > > >>> Most urgent and important one is FLINK-33523 [4] and according
> to
> > > the
> > > >>> discussion thread[5] on the ML, 1.18.1 should/must be released
> > asap
> > > > after
> > > >>> the breaking change commit has been reverted.
> > > >>>
> > > >>> I am not aware of any other unresolved blockers and there are
> no
> > > >> in-progress
> > > >>> tickets [6].
> > > >>> Please let me know if there are any issues you'd like to be
> > > included
> > > >> in
> > > >>> this release but still not merged.
> > > >>>
> > > >>> If the community agrees to create this new patch release, I
> could
> > > >>> volunteer as the release manager.
> > > >>>
> > > >>> Best regards,
> > > >>> Jing
> > > >>>
> > > >>> [1]
> > > >>>
> > > >>
> > > >
> > > >>
> > >
> >
> https://issues.apache.org/jira/browse/FLINK-33567?jql=project%20%3D%20FLINK%20AND%20fixVersion%20%3D%201.18.1%20%20and%20resolution%20%20!%3D%20%20Unresolved%20order%20by%20priority%20DESC
> > > >>> [2]
> > > >>>
> > > >>
> > > >
> > > >>
> > >
> >
> https://issues.apache.org/jira/browse/FLINK-33693?jql=project%20%3D%20FLINK%20AND%20fixVersion%20%3D%201.18.1%20and%20resolution%20%20!%3D%20Unresolved%20%20and%20priority%20in%20(Blocker%2C%20Critical)%20ORDER%20by%20priority%20%20DESC
> > > >>> [3] https://issues.apache.org/jira/browse/FLINK-33598
> > > >>> [4] https://issues.apache.org/jira/browse/FLINK-33523
> > > >>> [5]
> > > https://lists.apache.org/thread/m4c879y8mb7hbn2kkjh9h3d8g1jphh3j
> > > >>> [6]
> > > https://issues.apache.org/jira/projects/FLINK/versions/12353640
> > > >>> Thanks,
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >>
> > > >> Best,
> > > >> Benchao Li
> > > >>
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > > Sergey
> > > >
> > > >>>
> > > >>>
> > > >>>
> > > >>
> > > >>
> > >
> > >
> >
>


Re: [PROPOSAL] Contribute Flink CDC Connectors project to Apache Flink

2023-12-06 Thread Jark Wu
+1 for adding this to Apache Flink!

I think this can further extend the ability of Apache Flink and a lot of
users would be
interested to try this out.

Best,
Jark

On Thu, 7 Dec 2023 at 12:06, Samrat Deb  wrote:

> That's really cool :)
> +1 for the great addition
>
> Bests,
> Samrat
>
> On Thu, 7 Dec 2023 at 9:20 AM, Jingsong Li  wrote:
>
>> Wow, Cool, Nice
>>
>> CDC is playing an increasingly important role.
>>
>> +1
>>
>> Best,
>> Jingsong
>>
>> On Thu, Dec 7, 2023 at 11:25 AM Leonard Xu  wrote:
>> >
>> > Dear Flink devs,
>> >
>> > As you may have heard, we at Alibaba (Ververica) are planning to donate
>> CDC Connectors for the Apache Flink project[1] to the Apache Flink
>> community.
>> >
>> > CDC Connectors for Apache Flink comprise a collection of source
>> connectors designed specifically for Apache Flink. These connectors[2]
>> enable the ingestion of changes from various databases using Change Data
>> Capture (CDC), most of these CDC connectors are powered by Debezium[3].
>> They support both the DataStream API and the Table/SQL API, facilitating
>> the reading of database snapshots and continuous reading of transaction
>> logs with exactly-once processing, even in the event of failures.
>> >
>> >
>> > Additionally, in the latest version 3.0, we have introduced many
>> long-awaited features. Starting from CDC version 3.0, we've built a
>> Streaming ELT Framework available for streaming data integration. This
>> framework allows users to write their data synchronization logic in a
>> simple YAML file, which will automatically be translated into a Flink
>> DataStreaming job. It emphasizes optimizing the task submission process and
>> offers advanced functionalities such as whole database synchronization,
>> merging sharded tables, and schema evolution[4].
>> >
>> >
>> > I believe this initiative is a perfect match for both sides. For the
>> Flink community, it presents an opportunity to enhance Flink's competitive
>> advantage in streaming data integration, promoting the healthy growth and
>> prosperity of the Apache Flink ecosystem. For the CDC Connectors project,
>> becoming a sub-project of Apache Flink means being part of a neutral
>> open-source community, which can attract a more diverse pool of
>> contributors.
>> >
>> > Please note that the aforementioned points represent only some of our
>> motivations and vision for this donation. Specific future operations need
>> to be further discussed in this thread. For example, the sub-project name
>> after the donation; we hope to name it Flink-CDC aiming to streaming data
>> intergration through Apache Flink, following the naming convention of
>> Flink-ML; And this project is managed by a total of 8 maintainers,
>> including 3 Flink PMC members and 1 Flink Committer. The remaining 4
>> maintainers are also highly active contributors to the Flink community,
>> donating this project to the Flink community implies that their permissions
>> might be reduced. Therefore, we may need to bring up this topic for further
>> discussion within the Flink PMC. Additionally, we need to discuss how to
>> migrate existing users and documents. We have a user group of nearly 10,000
>> people and a multi-version documentation site need to migrate. We also need
>> to plan for the migration of CI/CD processes and other specifics.
>> >
>> >
>> > While there are many intricate details that require implementation, we
>> are committed to progressing and finalizing this donation process.
>> >
>> >
>> > Despite being Flink’s most active ecological project (as evaluated by
>> GitHub metrics), it also boasts a significant user base. However, I believe
>> it's essential to commence discussions on future operations only after the
>> community reaches a consensus on whether they desire this donation.
>> >
>> >
>> > Really looking forward to hear what you think!
>> >
>> >
>> > Best,
>> > Leonard (on behalf of the Flink CDC Connectors project maintainers)
>> >
>> > [1] https://github.com/ververica/flink-cdc-connectors
>> > [2]
>> https://ververica.github.io/flink-cdc-connectors/master/content/overview/cdc-connectors.html
>> > [3] https://debezium.io
>> > [4]
>> https://ververica.github.io/flink-cdc-connectors/master/content/overview/cdc-pipeline.html
>>
>


[jira] [Created] (FLINK-33600) Print cost time for batch queries in SQL Client

2023-11-20 Thread Jark Wu (Jira)
Jark Wu created FLINK-33600:
---

 Summary: Print cost time for batch queries in SQL Client
 Key: FLINK-33600
 URL: https://issues.apache.org/jira/browse/FLINK-33600
 Project: Flink
  Issue Type: New Feature
  Components: Table SQL / Client
Reporter: Jark Wu
 Fix For: 1.19.0


Currently, there is no cost time information when executing batch queries in 
SQL CLI. But this is very helpful in OLAP/ad-hoc scenarios. 

For example: 
{code}
Flink SQL> select * from (values ('abc', 123));
+++
| EXPR$0 | EXPR$1 |
+++
|abc |123 |
+++
1 row in set  (0.22 seconds)
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] FLIP-378: Support Avro timestamp with local timezone

2023-11-15 Thread Jark Wu
+1 (binding)

Best,
Jark

On Thu, 16 Nov 2023 at 12:41, Leonard Xu  wrote:

> +1(binding)
>
> Best,
> Leonard
>
> > 2023年11月16日 下午12:13,Mingliang Liu  写道:
> >
> > +1 (non-binding)
> >
> > On Wed, Nov 15, 2023 at 3:38 PM Peter Huang 
> > wrote:
> >
> >> Hi Devs,
> >>
> >> I'd like to start a vote on FLIP-378: Support Avro timestamp with local
> >> timezone which has been discussed in this thread [2].
> >>
> >> The vote will be open for at least 72 hours unless there is an
> objection or
> >> not enough votes.
> >>
> >>
> >> [1] https://cwiki.apache.org/confluence/x/Hgt1E
> >> [2] https://lists.apache.org/thread/fyv16p40z9go0dhosc4cr2ywqclyqqq5
> >>
> >>
> >> Best Regards
> >> Peter Huang
> >>
>
>


Re: [DISCUSS] FLIP-378: Support Avro timestamp with local timezone

2023-11-13 Thread Jark Wu
+1

I think we can mark the temporary config option as deprecated when we
introduce it.
So we can remove it after 2 minor releases (1.19, 1.20), i.e., drop in 2.0
release.

A minor comment about the config option, I would suggest to use
"avro.timestamp-mapping.legacy"
instead of "avro.timestamp_mapping.legacy". Because Flink community prefer
dash than underline[1].

Best.,
Jark

[1]:
https://flink.apache.org/how-to-contribute/code-style-and-quality-components/#configuration-changes

On Sat, 4 Nov 2023 at 21:40, Gyula Fóra  wrote:

> +1
>
> Gyula
>
> On Thu, Nov 2, 2023 at 6:18 AM Martijn Visser 
> wrote:
>
> > +1
> >
> > On Thu, Nov 2, 2023 at 12:44 PM Leonard Xu  wrote:
> > >
> > >
> > > > Thanks @Leonard Xu . Two minor versions
> are
> > definitely needed for flip the configs.
> > >
> > > Sorry, Peter. I thought the next minor versions are 1.19、2.0, but
> > actually it should be 1.19、1.20、2.0 from current community version plan
> > IIUC, so remove the config in 2.0 should be okay if the 1.20 version
> exists
> > .
> > >
> > > Best,
> > > Leonard
> > >
> > >
> > > >
> > > > On Mon, Oct 30, 2023 at 8:55 PM Leonard Xu   > xbjt...@gmail.com>> wrote:
> > > > Thanks @Peter for driving this FLIP
> > > >
> > > > +1 from my side, the timestamp semantics mapping looks good to me.
> > > >
> > > > >  In the end, the legacy behavior will be dropped in
> > > > > Flink 2.0
> > > > > I don’t think we can drop this option which introduced in 1.19 and
> > drop in 2.0, the API removal requires at least two minor versions.
> > > >
> > > >
> > > > Best,
> > > > Leonard
> > > >
> > > > > 2023年10月31日 上午11:18,Peter Huang   > huangzhenqiu0...@gmail.com>> 写道:
> > > > >
> > > > > Hi Devs,
> > > > >
> > > > > Currently, Flink Avro Format doesn't support the Avro time
> > (milli/micros)
> > > > > with local timezone type.
> > > > > Although the Avro timestamp (millis/micros) type is supported and
> is
> > mapped
> > > > > to flink timestamp without timezone,
> > > > > it is not compliant to semantics defined in Consistent timestamp
> > types in
> > > > > Hadoop SQL engines
> > > > > <
> >
> https://docs.google.com/document/d/1gNRww9mZJcHvUDCXklzjFEQGpefsuR_akCDfWsdE35Q/edit#heading=h.n699ftkvhjlo
> > <
> >
> https://docs.google.com/document/d/1gNRww9mZJcHvUDCXklzjFEQGpefsuR_akCDfWsdE35Q/edit#heading=h.n699ftkvhjlo
> > >>
> > > > > .
> > > > >
> > > > > I propose to support Avro timestamps with the compliance to the
> > mapping
> > > > > semantics [1] by using a configuration flag.
> > > > > To keep back compatibility, current mapping is kept as default
> > behavior.
> > > > > Users can explicitly turn on the new mapping
> > > > > by setting it to false. In the end, the legacy behavior will be
> > dropped in
> > > > > Flink 2.0
> > > > >
> > > > > Looking forward to your feedback.
> > > > >
> > > > >
> > > > > [1]
> > > > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-378%3A+Support+Avro+timestamp+with+local+timezone
> > <
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-378%3A+Support+Avro+timestamp+with+local+timezone
> > >
> > > > >
> > > > >
> > > > > Best Regards
> > > > >
> > > > > Peter Huang
> > > >
> > >
> >
>


Re: [DISCUSS] FLIP-376: Add DISTRIBUTED BY clause

2023-10-30 Thread Jark Wu
Hi Timo,

Thank you for the update. The FLIP looks good to me now.
I only have one more question.

What does Flink check and throw exceptions for the bucketing?
For example, do we check interfaces when executing create/alter
DDL and when used as a source?

Best,
Jark

On Tue, 31 Oct 2023 at 00:25, Timo Walther  wrote:

> Hi Jing,
>
>  > Have you considered using BUCKET BY directly?
>
> Which vendor uses this syntax? Most vendors that I checked call this
> concept "distribution".
>
> In any case, the "BY" is optional, so certain DDL statements would
> declare it like "BUCKET INTO 6 BUCKETS"? And following the PARTITIONED,
> we should use the passive voice.
>
>  > Did you mean users can use their own algorithm? How to do it?
>
> "own algorithm" only refers to deciding between a list of partitioning
> strategies (namely hash and range partitioning) if the connector offers
> more than one.
>
> Regards,
> Timo
>
>
> On 30.10.23 12:39, Jing Ge wrote:
> > Hi Timo,
> >
> > The FLIP looks great! Thanks for bringing it to our attention! In order
> to
> > make sure we are on the same page, I would ask some questions:
> >
> > 1. DISTRIBUTED BY reminds me DISTRIBUTE BY from Hive like Benchao
> mentioned
> > which is used to distribute rows amond reducers, i.e. focusing on the
> > shuffle during the computation. The FLIP is focusing more on storage, if
> I
> > am not mistaken. Have you considered using BUCKET BY directly?
> >
> > 2. According to the FLIP: " CREATE TABLE MyTable (uid BIGINT, name
> STRING)
> > DISTRIBUTED BY HASH(uid) INTO 6 BUCKETS
> >
> > - For advanced users, the algorithm can be defined explicitly.
> > - Currently, either HASH() or RANGE().
> >
> > "
> > Did you mean users can use their own algorithm? How to do it?
> >
> > Best regards,
> > Jing
> >
> > On Mon, Oct 30, 2023 at 11:13 AM Timo Walther 
> wrote:
> >
> >> Let me reply to the feedback from Yunfan:
> >>
> >>   > Distribute by in DML is also supported by Hive
> >>
> >> I see DISTRIBUTED BY and DISTRIBUTE BY as two separate discussions. This
> >> discussion is about DDL. For DDL, we have more freedom as every vendor
> >> has custom syntax for CREATE TABLE clauses. Furthermore, this is tightly
> >> connector to the connector implementation, not the engine. However, for
> >> DML we need to watch out for standard compliance and introduce changes
> >> with high caution.
> >>
> >> How a LookupTableSource interprets the DISTRIBUTED BY is
> >> connector-dependent in my opinion. In general this FLIP is a sink
> >> ability, but we could have a follow FLIP that helps in distributing load
> >> of lookup joins.
> >>
> >>   > to avoid data skew problem
> >>
> >> I understand the use case and that it is important to solve it
> >> eventually. Maybe a solution might be to introduce helper Polymorphic
> >> Table Functions [1] in the future instead of new syntax.
> >>
> >> [1]
> >>
> >>
> https://www.ifis.uni-luebeck.de/~moeller/Lectures/WS-19-20/NDBDM/12-Literature/Michels-SQL-2016.pdf
> >>
> >>
> >> Let me reply to the feedback from Benchao:
> >>
> >>   > Do you think it's useful to add some extensibility for the hash
> >> strategy
> >>
> >> The hash strategy is fully determined by the connector, not the Flink
> >> SQL engine. We are not using Flink's hash strategy in any way. If the
> >> hash strategy for the regular Flink file system connector should be
> >> changed, this should be expressed via config option. Otherwise we should
> >> offer a dedicated `hive-filesystem` or `spark-filesystem` connector.
> >>
> >> Regards,
> >> Timo
> >>
> >>
> >> On 30.10.23 10:44, Timo Walther wrote:
> >>> Hi Jark,
> >>>
> >>> my intention was to avoid too complex syntax in the first version. In
> >>> the past years, we could enable use cases also without this clause, so
> >>> we should be careful with overloading it with too functionality in the
> >>> first version. We can still iterate on it later, the interfaces are
> >>> flexible enough to support more in the future.
> >>>
> >>> I agree that maybe an explicit HASH and RANGE doesn't harm. Also making
> >>> the bucket number optional.
> >>>
> >>> I updated the FLIP accordingly. Now the S

Re: [DISCUSS] Planning Flink 1.19

2023-10-29 Thread Jark Wu
+1  for the proposed release managers and feature freeze on Jan 26
sounds good to me.

Best.
Jark

On Mon, 30 Oct 2023 at 12:15, Xintong Song  wrote:

> Thanks for kicking this off.
>
> +1 for the proposed release managers (Lincoln, Yun, Jing and Martijn) and
> targeting date (feature freeze: Jan 26).
>
> I'd like to bring up that it is likely many efforts in the 1.19 release
> cycle are also related to the 2.0 release. I think it would be better if
> the 1.19 and 2.0 RMs can track these efforts jointly rather than
> separately. It would be appreciated if you can also involve us for the 1.19
> discussions & syncs.
>
> Best,
>
> Xintong
>
>
>
> On Sun, Oct 29, 2023 at 5:30 PM Leonard Xu  wrote:
>
> > Thanks everyone for volunteering. +1 for Lincoln, Yun Tang, Jing and
> > Martijn as release managers from my side.
> >
> > The combination of experienced release managers and new release managers
> > can help community cultivate more committers with version management
> > experience.
> >
> > Best,
> > Leonard
> >
> > > 2023年10月27日 下午5:22,Matthias Pohl  写道:
> > >
> > > +1 from my side for Lincoln, Yun Tang, Jing and Martijn as release
> > managers.
> > > Thanks everyone for volunteering.
> > >
> > > I tried to collect the different tasks that are part of release
> > management
> > > in [1]. It might help to identify responsibilities. Feel free to have a
> > > look and/or update it. Ideally, it will help others to decide whether
> > they
> > > feel ready to contribute to the community as release managers in future
> > > releases.
> > >
> > > [1]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Release+Management
> > >
> > >
> > > On Thu, Oct 26, 2023 at 9:15 PM Martijn Visser
> > 
> > > wrote:
> > >
> > >> Hi Lincoln and Yun,
> > >>
> > >> Happy to jump on board as release manager too :)
> > >>
> > >> Best regards,
> > >>
> > >> Martijn
> > >>
> > >> Op do 26 okt 2023 om 20:50 schreef Jing Ge  >
> > >>
> > >>> Hi Lincoln,
> > >>>
> > >>> Thanks for kicking off 1.19! I got a lot of experience as a release
> > >> manager
> > >>> for the 1.18 release. I would like to join you and participate in the
> > >> 1.19
> > >>> release cycle.
> > >>>
> > >>> Best regards,
> > >>> Jing
> > >>>
> > >>> On Thu, Oct 26, 2023 at 6:27 PM Lincoln Lee 
> > >>> wrote:
> > >>>
> >  Hi everyone,
> > 
> >  With the release announcement of Flink 1.18, it’s a good time to
> kick
> > >> off
> >  discussion of the next release 1.19.
> > 
> >  - Release managers
> > 
> >  Yun Tang and I would like to volunteer as release managers for 1.19,
> > >> and
> >  it would be great to have someone else working together on this
> > >> release.
> >  Please let us know if you have any interest!
> > 
> >  - Timeline
> > 
> >  Flink 1.18 has been released. With a target release cycle of 4
> months,
> > >> we
> >  propose a feature freeze date of *Jan 26, 2024*.
> > 
> >  - Collecting features
> > 
> >  As usual, we've created a wiki page[1] for collecting new features
> in
> > >>> 1.19.
> >  In addition, we already have a number of FLIPs that have been voted
> or
> > >>> are
> >  in
> >  the process, including pre-works for version 2.0.
> >  In the meantime, the release management team will be finalized in
> the
> > >>> next
> >  few days,
> >  and we'll continue to create Jira Boards and Sync meetings to make
> it
> > >>> easy
> >  for
> >  everyone to get an overview and track progress.
> > 
> >  1. https://cwiki.apache.org/confluence/display/FLINK/1.19+Release
> > 
> >  Best,
> >  Lincoln Lee
> > 
> > >>>
> > >>
> >
> >
>


Re: [DISCUSS] FLIP-377: Support configuration to disable filter push down for Table/SQL Sources

2023-10-27 Thread Jark Wu
Hi Becket,

I checked the history of "
*table.optimizer.source.predicate-pushdown-enabled*",
it seems it was introduced since the legacy FilterableTableSource interface
which might be an experiential feature at that time. I don't see the
necessity
of this option at the moment. Maybe we can deprecate this option and drop
it
in Flink 2.0[1] if it is not necessary anymore. This may help to
simplify this discussion.


Best,
Jark

[1]: https://issues.apache.org/jira/browse/FLINK-32383



On Thu, 26 Oct 2023 at 10:14, Becket Qin  wrote:

> Thanks for the proposal, Jiabao. My two cents below:
>
> 1. If I understand correctly, the motivation of the FLIP is mainly to make
> predicate pushdown optional on SOME of the Sources. If so, intuitively the
> configuration should be Source specific instead of general. Otherwise, we
> will end up with general configurations that may not take effect for some
> of the Source implementations. This violates the basic rule of a
> configuration - it does what it says, regardless of the implementation.
> While configuration standardization is usually a good thing, it should not
> break the basic rules.
> If we really want to have this general configuration, for the sources this
> configuration does not apply, they should throw an exception to make it
> clear that this configuration is not supported. However, that seems ugly.
>
> 2. I think the actual motivation of this FLIP is about "how a source
> should implement predicate pushdown efficiently", not "whether predicate
> pushdown should be applied to the source." For example, if a source wants
> to avoid additional computing load in the external system, it can always
> read the entire record and apply the predicates by itself. However, from
> the Flink perspective, the predicate pushdown is applied, it is just
> implemented differently by the source. So the design principle here is that
> Flink only cares about whether a source supports predicate pushdown or not,
> it does not care about the implementation efficiency / side effect of the
> predicates pushdown. It is the Source implementation's responsibility to
> ensure the predicates pushdown is implemented efficiently and does not
> impose excessive pressure on the external system. And it is OK to have
> additional configurations to achieve this goal. Obviously, such
> configurations will be source specific in this case.
>
> 3. Regarding the existing configurations of 
> *table.optimizer.source.predicate-pushdown-enabled.
> *I am not sure why we need it. Supposedly, if a source implements a
> SupportsXXXPushDown interface, the optimizer should push the corresponding
> predicates to the Source. I am not sure in which case this configuration
> would be used. Any ideas @Jark Wu ?
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
> On Wed, Oct 25, 2023 at 11:55 PM Jiabao Sun
>  wrote:
>
>> Thanks Jane for the detailed explanation.
>>
>> I think that for users, we should respect conventions over
>> configurations.
>> Conventions can be default values explicitly specified in configurations,
>> or they can be behaviors that follow previous versions.
>> If the same code has different behaviors in different versions, it would
>> be a very bad thing.
>>
>> I agree that for regular users, it is not necessary to understand all the
>> configurations related to Flink.
>> By following conventions, they can have a good experience.
>>
>> Let's get back to the practical situation and consider it.
>>
>> Case 1:
>> The user is not familiar with the purpose of the
>> table.optimizer.source.predicate-pushdown-enabled configuration but follows
>> the convention of allowing predicate pushdown to the source by default.
>> Just understanding the source.predicate-pushdown-enabled configuration
>> and performing fine-grained toggle control will work well.
>>
>> Case 2:
>> The user understands the meaning of the
>> table.optimizer.source.predicate-pushdown-enabled configuration and has set
>> its value to false.
>> We have reason to believe that the user understands the meaning of the
>> predicate pushdown configuration and the intention is to disable predicate
>> pushdown (rather than whether or not to allow it).
>> The previous choice of globally disabling it is likely because it
>> couldn't be disabled on individual sources.
>> From this perspective, if we provide more fine-grained configuration
>> support and provide detailed explanations of the configuration behaviors in
>> the documentation,
>> users can clearly understand the differences between these two
>> configurations and use them correctly.
>>
>

Re: [DISCUSS] FLIP-376: Add DISTRIBUTED BY clause

2023-10-27 Thread Jark Wu
Hi Timo,

Thanks for starting this discussion. I really like it!
The FLIP is already in good shape, I only have some minor comments.

1. Could we also support HASH and RANGE distribution kind on the DDL
syntax?
I noticed that HASH and UNKNOWN are introduced in the Java API, but not in
the syntax.

2. Can we make "INTO n BUCKETS" optional in CREATE TABLE and ALTER TABLE?
Some storage engines support automatically determining the bucket number
based on
the cluster resources and data size of the table. For example, StarRocks[1]
and Paimon[2].

Best,
Jark

[1]:
https://docs.starrocks.io/en-us/latest/table_design/Data_distribution#determine-the-number-of-buckets
[2]:
https://paimon.apache.org/docs/0.5/concepts/primary-key-table/#dynamic-bucket

On Thu, 26 Oct 2023 at 18:26, Jingsong Li  wrote:

> Very thanks Timo for starting this discussion.
>
> Big +1 for this.
>
> The design looks good to me!
>
> We can add some documentation for connector developers. For example:
> for sink, If there needs some keyby, please finish the keyby by the
> connector itself. SupportsBucketing is just a marker interface.
>
> Best,
> Jingsong
>
> On Thu, Oct 26, 2023 at 5:00 PM Timo Walther  wrote:
> >
> > Hi everyone,
> >
> > I would like to start a discussion on FLIP-376: Add DISTRIBUTED BY
> > clause [1].
> >
> > Many SQL vendors expose the concepts of Partitioning, Bucketing, and
> > Clustering. This FLIP continues the work of previous FLIPs and would
> > like to introduce the concept of "Bucketing" to Flink.
> >
> > This is a pure connector characteristic and helps both Apache Kafka and
> > Apache Paimon connectors in avoiding a complex WITH clause by providing
> > improved syntax.
> >
> > Here is an example:
> >
> > CREATE TABLE MyTable
> >(
> >  uid BIGINT,
> >  name STRING
> >)
> >DISTRIBUTED BY (uid) INTO 6 BUCKETS
> >WITH (
> >  'connector' = 'kafka'
> >)
> >
> > The full syntax specification can be found in the document. The clause
> > should be optional and fully backwards compatible.
> >
> > Regards,
> > Timo
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-376%3A+Add+DISTRIBUTED+BY+clause
>


Re: [ANNOUNCE] Apache Flink 1.18.0 released

2023-10-26 Thread Jark Wu
Congratulations and thanks release managers and everyone who has
contributed!

Best,
Jark

On Fri, 27 Oct 2023 at 12:25, Hang Ruan  wrote:

> Congratulations!
>
> Best,
> Hang
>
> Samrat Deb  于2023年10月27日周五 11:50写道:
>
> > Congratulations on the great release
> >
> > Bests,
> > Samrat
> >
> > On Fri, 27 Oct 2023 at 7:59 AM, Yangze Guo  wrote:
> >
> > > Great work! Congratulations to everyone involved!
> > >
> > > Best,
> > > Yangze Guo
> > >
> > > On Fri, Oct 27, 2023 at 10:23 AM Qingsheng Ren 
> wrote:
> > > >
> > > > Congratulations and big THANK YOU to everyone helping with this
> > release!
> > > >
> > > > Best,
> > > > Qingsheng
> > > >
> > > > On Fri, Oct 27, 2023 at 10:18 AM Benchao Li 
> > > wrote:
> > > >>
> > > >> Great work, thanks everyone involved!
> > > >>
> > > >> Rui Fan <1996fan...@gmail.com> 于2023年10月27日周五 10:16写道:
> > > >> >
> > > >> > Thanks for the great work!
> > > >> >
> > > >> > Best,
> > > >> > Rui
> > > >> >
> > > >> > On Fri, Oct 27, 2023 at 10:03 AM Paul Lam 
> > > wrote:
> > > >> >
> > > >> > > Finally! Thanks to all!
> > > >> > >
> > > >> > > Best,
> > > >> > > Paul Lam
> > > >> > >
> > > >> > > > 2023年10月27日 03:58,Alexander Fedulov <
> > alexander.fedu...@gmail.com>
> > > 写道:
> > > >> > > >
> > > >> > > > Great work, thanks everyone!
> > > >> > > >
> > > >> > > > Best,
> > > >> > > > Alexander
> > > >> > > >
> > > >> > > > On Thu, 26 Oct 2023 at 21:15, Martijn Visser <
> > > martijnvis...@apache.org>
> > > >> > > > wrote:
> > > >> > > >
> > > >> > > >> Thank you all who have contributed!
> > > >> > > >>
> > > >> > > >> Op do 26 okt 2023 om 18:41 schreef Feng Jin <
> > > jinfeng1...@gmail.com>
> > > >> > > >>
> > > >> > > >>> Thanks for the great work! Congratulations
> > > >> > > >>>
> > > >> > > >>>
> > > >> > > >>> Best,
> > > >> > > >>> Feng Jin
> > > >> > > >>>
> > > >> > > >>> On Fri, Oct 27, 2023 at 12:36 AM Leonard Xu <
> > xbjt...@gmail.com>
> > > wrote:
> > > >> > > >>>
> > > >> > >  Congratulations, Well done!
> > > >> > > 
> > > >> > >  Best,
> > > >> > >  Leonard
> > > >> > > 
> > > >> > >  On Fri, Oct 27, 2023 at 12:23 AM Lincoln Lee <
> > > lincoln.8...@gmail.com>
> > > >> > >  wrote:
> > > >> > > 
> > > >> > > > Thanks for the great work! Congrats all!
> > > >> > > >
> > > >> > > > Best,
> > > >> > > > Lincoln Lee
> > > >> > > >
> > > >> > > >
> > > >> > > > Jing Ge  于2023年10月27日周五
> > 00:16写道:
> > > >> > > >
> > > >> > > >> The Apache Flink community is very happy to announce the
> > > release of
> > > >> > > > Apache
> > > >> > > >> Flink 1.18.0, which is the first release for the Apache
> > > Flink 1.18
> > > >> > > > series.
> > > >> > > >>
> > > >> > > >> Apache Flink® is an open-source unified stream and batch
> > data
> > > >> > >  processing
> > > >> > > >> framework for distributed, high-performing,
> > > always-available, and
> > > >> > > > accurate
> > > >> > > >> data applications.
> > > >> > > >>
> > > >> > > >> The release is available for download at:
> > > >> > > >> https://flink.apache.org/downloads.html
> > > >> > > >>
> > > >> > > >> Please check out the release blog post for an overview of
> > the
> > > >> > > > improvements
> > > >> > > >> for this release:
> > > >> > > >>
> > > >> > > >>
> > > >> > > >
> > > >> > > 
> > > >> > > >>>
> > > >> > > >>
> > > >> > >
> > >
> >
> https://flink.apache.org/2023/10/24/announcing-the-release-of-apache-flink-1.18/
> > > >> > > >>
> > > >> > > >> The full release notes are available in Jira:
> > > >> > > >>
> > > >> > > >>
> > > >> > > >
> > > >> > > 
> > > >> > > >>>
> > > >> > > >>
> > > >> > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12352885
> > > >> > > >>
> > > >> > > >> We would like to thank all contributors of the Apache
> Flink
> > > >> > > >> community
> > > >> > >  who
> > > >> > > >> made this release possible!
> > > >> > > >>
> > > >> > > >> Best regards,
> > > >> > > >> Konstantin, Qingsheng, Sergey, and Jing
> > > >> > > >>
> > > >> > > >
> > > >> > > 
> > > >> > > >>>
> > > >> > > >>
> > > >> > >
> > > >> > >
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >>
> > > >> Best,
> > > >> Benchao Li
> > >
> >
>


Re: [VOTE] FLIP-373: Support Configuring Different State TTLs using SQL Hint

2023-10-25 Thread Jark Wu
+1 (binding)

Best,
Jark

On Wed, 25 Oct 2023 at 16:27, Jiabao Sun 
wrote:

> Thanks Jane for driving this.
>
> +1 (non-binding)
>
> Best,
> Jiabao
>
>
> > 2023年10月25日 16:22,Lincoln Lee  写道:
> >
> > +1 (binding)
> >
> > Best,
> > Lincoln Lee
> >
> >
> > Zakelly Lan  于2023年10月23日周一 14:15写道:
> >
> >> +1(non-binding)
> >>
> >> Best,
> >> Zakelly
> >>
> >> On Mon, Oct 23, 2023 at 1:15 PM Benchao Li 
> wrote:
> >>>
> >>> +1 (binding)
> >>>
> >>> Feng Jin  于2023年10月23日周一 13:07写道:
> 
>  +1(non-binding)
> 
> 
>  Best,
>  Feng
> 
> 
>  On Mon, Oct 23, 2023 at 11:58 AM Xuyang  wrote:
> 
> > +1(non-binding)
> >
> >
> >
> >
> > --
> >
> >Best!
> >Xuyang
> >
> >
> >
> >
> >
> > At 2023-10-23 11:38:15, "Jane Chan"  wrote:
> >> Hi developers,
> >>
> >> Thanks for all the feedback on FLIP-373: Support Configuring
> >> Different
> >> State TTLs using SQL Hint [1].
> >> Based on the discussion [2], we have reached a consensus, so I'd
> >> like to
> >> start a vote.
> >>
> >> The vote will last for at least 72 hours (Oct. 26th at 10:00 A.M.
> >> GMT)
> >> unless there is an objection or insufficient votes.
> >>
> >> [1]
> >>
> >
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-373%3A+Support+Configuring+Different+State+TTLs+using+SQL+Hint
> >> [2]
> >> https://lists.apache.org/thread/3s69dhv3rp4s0kysnslqbvyqo3qf7zq5
> >>
> >> Best,
> >> Jane
> >
> >>>
> >>>
> >>>
> >>> --
> >>>
> >>> Best,
> >>> Benchao Li
> >>
>
>


Re: [DISCUSS] FLIP-377: Support configuration to disable filter push down for Table/SQL Sources

2023-10-24 Thread Jark Wu
Thank you for updating Jiabao,

The FLIP looks good to me.

Best,
Jark

On Wed, 25 Oct 2023 at 00:42, Jiabao Sun 
wrote:

> Thanks Jane for the feedback.
>
> The default value of "table.optimizer.source.predicate" is true that means
> by default,
> allowing predicate pushdown to all sources is permitted.
>
> Therefore, disabling the pushdown filter for individual sources can take
> effect.
>
>
> Best,
> Jiabao
>
>
> > 2023年10月24日 23:52,Jane Chan  写道:
> >
> >>
> >> I believe that the configuration "table.optimizer.source.predicate" has
> a
> >> higher priority at the planner level than the configuration at the
> source
> >> level,
> >> and it seems easy to implement now.
> >>
> >
> > Correct me if I'm wrong, but I think the fine-grained configuration
> > "scan.filter-push-down.enabled" should have a higher priority because the
> > default value of "table.optimizer.source.predicate" is true. As a result,
> > turning off filter push-down for a specific source will not take effect
> > unless the default value of "table.optimizer.source.predicate" is changed
> > to false, or, alternatively, let users manually set
> > "table.optimizer.source.predicate" to false first and then selectively
> > enable filter push-down for the desired sources, which is less intuitive.
> > WDYT?
> >
> > Best,
> > Jane
> >
> > On Tue, Oct 24, 2023 at 6:05 PM Jiabao Sun  .invalid>
> > wrote:
> >
> >> Thanks Jane,
> >>
> >> I believe that the configuration "table.optimizer.source.predicate" has
> a
> >> higher priority at the planner level than the configuration at the
> source
> >> level,
> >> and it seems easy to implement now.
> >>
> >> Best,
> >> Jiabao
> >>
> >>
> >>> 2023年10月24日 17:36,Jane Chan  写道:
> >>>
> >>> Hi Jiabao,
> >>>
> >>> Thanks for driving this discussion. I have a small question that will
> >>> "scan.filter-push-down.enabled" take precedence over
> >>> "table.optimizer.source.predicate" when the two parameters might
> conflict
> >>> each other?
> >>>
> >>> Best,
> >>> Jane
> >>>
> >>> On Tue, Oct 24, 2023 at 5:05 PM Jiabao Sun  >> .invalid>
> >>> wrote:
> >>>
> >>>> Thanks Jark,
> >>>>
> >>>> If we only add configuration without adding the enableFilterPushDown
> >>>> method in the SupportsFilterPushDown interface,
> >>>> each connector would have to handle the same logic in the applyFilters
> >>>> method to determine whether filter pushdown is needed.
> >>>> This would increase complexity and violate the original behavior of
> the
> >>>> applyFilters method.
> >>>>
> >>>> On the contrary, we only need to pass the configuration parameter in
> the
> >>>> newly added enableFilterPushDown method
> >>>> to decide whether to perform predicate pushdown.
> >>>>
> >>>> I think this approach would be clearer and simpler.
> >>>> WDYT?
> >>>>
> >>>> Best,
> >>>> Jiabao
> >>>>
> >>>>
> >>>>> 2023年10月24日 16:58,Jark Wu  写道:
> >>>>>
> >>>>> Hi JIabao,
> >>>>>
> >>>>> I think the current interface can already satisfy your requirements.
> >>>>> The connector can reject all the filters by returning the input
> filters
> >>>>> as `Result#remainingFilters`.
> >>>>>
> >>>>> So maybe we don't need to introduce a new method to disable
> >>>>> pushdown, but just introduce an option for the specific connector.
> >>>>>
> >>>>> Best,
> >>>>> Jark
> >>>>>
> >>>>> On Tue, 24 Oct 2023 at 16:38, Leonard Xu  wrote:
> >>>>>
> >>>>>> Thanks @Jiabao for kicking off this discussion.
> >>>>>>
> >>>>>> Could you add a section to explain the difference between proposed
> >>>>>> connector level config `scan.filter-push-down.enabled` and existing
> >>>> query
> >>>>>> level config `table.optimizer.source.predicate-pushdow

Re: [DISCUSS] FLIP-377: Support configuration to disable filter push down for Table/SQL Sources

2023-10-24 Thread Jark Wu
Hi JIabao,

I think the current interface can already satisfy your requirements.
The connector can reject all the filters by returning the input filters
as `Result#remainingFilters`.

So maybe we don't need to introduce a new method to disable
pushdown, but just introduce an option for the specific connector.

Best,
Jark

On Tue, 24 Oct 2023 at 16:38, Leonard Xu  wrote:

> Thanks @Jiabao for kicking off this discussion.
>
> Could you add a section to explain the difference between proposed
> connector level config `scan.filter-push-down.enabled` and existing query
> level config `table.optimizer.source.predicate-pushdown-enabled` ?
>
> Best,
> Leonard
>
> > 2023年10月24日 下午4:18,Jiabao Sun  写道:
> >
> > Hi Devs,
> >
> > I would like to start a discussion on FLIP-377: support configuration to
> disable filter pushdown for Table/SQL Sources[1].
> >
> > Currently, Flink Table/SQL does not expose fine-grained control for
> users to enable or disable filter pushdown.
> > However, filter pushdown has some side effects, such as additional
> computational pressure on external systems.
> > Moreover, Improper queries can lead to issues such as full table scans,
> which in turn can impact the stability of external systems.
> >
> > Suppose we have an SQL query with two sources: Kafka and a database.
> > The database is sensitive to pressure, and we want to configure it to
> not perform filter pushdown to the database source.
> > However, we still want to perform filter pushdown to the Kafka source to
> decrease network IO.
> >
> > I propose to support configuration to disable filter push down for
> Table/SQL sources to let user decide whether to perform filter pushdown.
> >
> > Looking forward to your feedback.
> >
> > [1]
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=276105768
> >
> > Best,
> > Jiabao
>
>


[ANNOUNCE] New Apache Flink Committer - Jane Chan

2023-10-15 Thread Jark Wu
Hi, everyone

On behalf of the PMC, I'm very happy to announce Jane Chan as a new Flink
Committer.

Jane started code contribution in Jan 2021 and has been active in the Flink
community since. She authored more than 60 PRs and reviewed more than 40
PRs. Her contribution mainly revolves around Flink SQL, including Plan
Advice (FLIP-280), operator-level state TTL (FLIP-292), and ALTER TABLE
statements (FLINK-21634). Jane participated deeply in development
discussions and also helped answer user question emails. Jane was also a
core contributor of Flink Table Store (now Paimon) when the project was in
the early days.

Please join me in congratulating Jane Chan for becoming a Flink Committer!

Best,
Jark Wu (on behalf of the Flink PMC)


[ANNOUNCE] New Apache Flink Committer - Ron Liu

2023-10-15 Thread Jark Wu
Hi, everyone

On behalf of the PMC, I'm very happy to announce Ron Liu as a new Flink
Committer.

Ron has been continuously contributing to the Flink project for many years,
authored and reviewed a lot of codes. He mainly works on Flink SQL parts
and drove several important FLIPs, e.g., USING JAR (FLIP-214), Operator
Fusion CodeGen (FLIP-315), Runtime Filter (FLIP-324). He has a great
knowledge of the Batch SQL and improved a lot of batch performance in the
past several releases. He is also quite active in mailing lists,
participating in discussions and answering user questions.

Please join me in congratulating Ron Liu for becoming a Flink Committer!

Best,
Jark Wu (on behalf of the Flink PMC)


Re: [DISCUSS] FLIP-328: Allow source operators to determine isProcessingBacklog based on watermark lag

2023-09-18 Thread Jark Wu
Hi Dong,

Sorry for the late reply.

> The rationale is that if there is any strategy that is triggered and says
> backlog=true, then job's backlog should be true. Otherwise, the job's
> backlog status is false.

I'm quite confused about this. Does that mean, if the source is in the
changelog phase, the source has to continuously invoke
"setIsProcessingBacklog(true)" (in an infinite loop?). Otherwise,
the job's backlog status would be set to false by the framework?

Best,
Jark

On Tue, 19 Sept 2023 at 09:13, Dong Lin  wrote:

> Hi Jark,
>
> Do you have time to comment on whether the current design looks good?
>
> I plan to start voting in 3 days if there is no follow-up comment.
>
> Thanks,
> Dong
>
>
> On Fri, Sep 15, 2023 at 2:01 PM Jark Wu  wrote:
>
> > Hi Dong,
> >
> > > Note that we can not simply enforce the semantics of "any invocation of
> > > setIsProcessingBacklog(false) will set the job's backlog status to
> > false".
> > > Suppose we have a job with two operators, where operatorA invokes
> > > setIsProcessingBacklog(false) and operatorB invokes
> > > setIsProcessingBacklog(true). There will be conflict if we use the
> > > semantics of "any invocation of setIsProcessingBacklog(false) will set
> > the
> > > job's backlog status to false".
> >
> > So it should set job's backlog status to false if the job has only a
> single
> > source,
> > right?
> >
> > Could you elaborate on the behavior if there is a job with a single
> source,
> > and the watermark lag exceeds the configured value (should set backlog to
> > true?),
> > but the source invokes "setIsProcessingBacklog(false)"? Or the inverse
> one,
> > the source invokes "setIsProcessingBacklog(false)" first, but the
> watermark
> > lag
> > exceeds the configured value.
> >
> > This is the conflict I'm concerned about.
> >
> > Best,
> > Jark
> >
> > On Fri, 15 Sept 2023 at 12:00, Dong Lin  wrote:
> >
> > > Hi Jark,
> > >
> > > Please see my comments inline.
> > >
> > > On Fri, Sep 15, 2023 at 10:35 AM Jark Wu  wrote:
> > >
> > > > Hi Dong,
> > > >
> > > > Please see my comments inline below.
> > >
> > >
> > > > > Hmm.. can you explain what you mean by "different watermark delay
> > > > > definitions for each source"?
> > > >
> > > > For example, "table1" defines a watermark with delay 5 seconds,
> > > > "table2" defines a watermark with delay 10 seconds. They have
> different
> > > > watermark delay definitions. So it is also reasonable they have
> > different
> > > > watermark lag definitions, e.g., "table1" allows "10mins" and
> "table2"
> > > > allows "20mins".
> > > >
> > >
> > > I think the watermark delay you mentioned above is conceptually /
> > > fundamentally different from the watermark-lag-threshold proposed in
> this
> > > FLIP.
> > >
> > > It might be useful to revisit the semantics of these two concepts:
> > > - watermark delay is used to account for the maximum amount of
> > orderliness
> > > that users expect (or willing to wait for) for records from a given
> > source.
> > > - watermark-lag-threshold is used to define when processing latency is
> no
> > > longer important (e.g. because data is already stale).
> > >
> > > Even though users might expect different out of orderliness for
> different
> > > sources, users do not necessarily have different definitions /
> thresholds
> > > for when a record is considered "already stale".
> > >
> > >
> > > >
> > > > > I think there is probably misunderstanding here. FLIP-309 does NOT
> > > > directly
> > > > > specify when backlog is false. It is intentionally specified in
> such
> > a
> > > > way
> > > > > that there will  not be any conflict between these rules.
> > > >
> > > > Do you mean FLIP-309 doesn't allow to specify backlog to be false?
> > > > Is this mentioned in FLIP-309? This is completely different from
> what I
> > > >
> > >
> > > Can you explain what you mean by "allow to specify backlog to be
> false"?
> > >
> > > If what you mean is that "can invoke se

Re: [DISCUSS] [FLINK-32873] Add a config to allow disabling Query hints

2023-09-15 Thread Jark Wu
Hi Martijn,

Thanks for the investigation. I found the blog[1] shows a use case
that they use "OPTIMIZER_IGNORE_HINTS" to check if embedded
hints can be removed in legacy code. I think this is a useful tool to
improve queries without complex hints strewn throughout the code.
Therefore, I'm fine to support this now.

Maybe we can follow Oracle to name the config
"table.optimizer.ignore-query-hints=false"?

Best,
Jark

[1]: https://dbaora.com/optimizer_ignore_hints-oracle-database-18c/

On Fri, 15 Sept 2023 at 17:57, Martijn Visser 
wrote:

> Hi Jark,
>
> Oracle has/had a generic "OPTIMIZER_IGNORE_HINTS" [1]. It looks like MSSQL
> has configuration options to disable specific hints [2] instead of a
> generic solution.
>
> [1]
>
> https://docs.oracle.com/en/database/oracle/oracle-database/23/refrn/OPTIMIZER_IGNORE_HINTS.html#GUID-D62CA6D8-D0D8-4A20-93EA-EEB4B3144347
> [2]
>
> https://www.mssqltips.com/sqlservertip/4175/disabling-sql-server-optimizer-rules-with-queryruleoff/
>
> Best regards,
>
> Martijn
>
> On Fri, Sep 15, 2023 at 10:53 AM Jark Wu  wrote:
>
> > Hi Martijn,
> >
> > I can understand that.
> > Is there any database/system that supports disabling/enabling query hints
> >  with a configuration? This might help us to understand the use case,
> > and follow the approach.
> >
> > Best,
> > Jark
> >
> > On Fri, 15 Sept 2023 at 15:34, Martijn Visser 
> > wrote:
> >
> > > Hi all,
> > >
> > > I think Jark has a valid point with:
> > >
> > > > Does this mean that in the future we might add an option to disable
> > each
> > > feature?
> > >
> > > I don't think that's a reasonable outcome indeed, but we are currently
> > in a
> > > situation where we don't have clear guidelines on when to add a
> > > configuration option, and when not to add one. From a platform
> > perspective,
> > > there might not be an imminent or obvious security implication, but you
> > > want to minimize a potential attack surface.
> > >
> > > > We should try to remove the unnecessary configuration from the list
> in
> > > Flink 2.0.
> > >
> > > I agree with that too.
> > >
> > > With these things in mind, my proposal would be the following:
> > >
> > > * Add a configuration option for this situation, given that we don't
> have
> > > clear guidelines on when to add/not add a new config option.
> > > * Since we want to overhaul the configuration layer anyway in Flink
> 2.0,
> > we
> > > clean-up the configuration list by not having an option for each item,
> > but
> > > either a generic option that allows you to disable one or more features
> > (by
> > > providing a list as the configuration option), or we already bundle
> > > multiple configuration options into a specific category, e.g. so that
> you
> > > can have a default Flink without any hardening, a read-only Flink, a
> > > fully-hardened Flink etc)
> > >
> > > Best regards,
> > >
> > > Martijn
> > >
> > >
> > > On Mon, Sep 11, 2023 at 4:51 PM Jim Hughes
>  > >
> > > wrote:
> > >
> > > > Hi Jing and Jark!
> > > >
> > > > I can definitely appreciate the desire to have fewer configurations.
> > > >
> > > > Do you have a suggested alternative for platform providers to limit
> or
> > > > restrict the hints that Bonnie is talking about?
> > > >
> > > > As one possibility, maybe one configuration could be set to control
> all
> > > > hints.
> > > >
> > > > Cheers,
> > > >
> > > > Jim
> > > >
> > > > On Sat, Sep 9, 2023 at 6:16 AM Jark Wu  wrote:
> > > >
> > > > > I agree with Jing,
> > > > >
> > > > > My biggest concern is this makes the boundary of adding an option
> > very
> > > > > unclear.
> > > > > It's not a strong reason to add a config just because of it doesn't
> > > > affect
> > > > > existing
> > > > > users. Does this mean that in the future we might add an option to
> > > > disable
> > > > > each feature?
> > > > >
> > > > > Flink already has a very long list of configurations [1][2] and
> this
> > is
> > > > > very scary
> > > > > and not easy to use. We should try

Re: [DISCUSS] [FLINK-32873] Add a config to allow disabling Query hints

2023-09-15 Thread Jark Wu
Hi Martijn,

I can understand that.
Is there any database/system that supports disabling/enabling query hints
 with a configuration? This might help us to understand the use case,
and follow the approach.

Best,
Jark

On Fri, 15 Sept 2023 at 15:34, Martijn Visser 
wrote:

> Hi all,
>
> I think Jark has a valid point with:
>
> > Does this mean that in the future we might add an option to disable each
> feature?
>
> I don't think that's a reasonable outcome indeed, but we are currently in a
> situation where we don't have clear guidelines on when to add a
> configuration option, and when not to add one. From a platform perspective,
> there might not be an imminent or obvious security implication, but you
> want to minimize a potential attack surface.
>
> > We should try to remove the unnecessary configuration from the list in
> Flink 2.0.
>
> I agree with that too.
>
> With these things in mind, my proposal would be the following:
>
> * Add a configuration option for this situation, given that we don't have
> clear guidelines on when to add/not add a new config option.
> * Since we want to overhaul the configuration layer anyway in Flink 2.0, we
> clean-up the configuration list by not having an option for each item, but
> either a generic option that allows you to disable one or more features (by
> providing a list as the configuration option), or we already bundle
> multiple configuration options into a specific category, e.g. so that you
> can have a default Flink without any hardening, a read-only Flink, a
> fully-hardened Flink etc)
>
> Best regards,
>
> Martijn
>
>
> On Mon, Sep 11, 2023 at 4:51 PM Jim Hughes 
> wrote:
>
> > Hi Jing and Jark!
> >
> > I can definitely appreciate the desire to have fewer configurations.
> >
> > Do you have a suggested alternative for platform providers to limit or
> > restrict the hints that Bonnie is talking about?
> >
> > As one possibility, maybe one configuration could be set to control all
> > hints.
> >
> > Cheers,
> >
> > Jim
> >
> > On Sat, Sep 9, 2023 at 6:16 AM Jark Wu  wrote:
> >
> > > I agree with Jing,
> > >
> > > My biggest concern is this makes the boundary of adding an option very
> > > unclear.
> > > It's not a strong reason to add a config just because of it doesn't
> > affect
> > > existing
> > > users. Does this mean that in the future we might add an option to
> > disable
> > > each feature?
> > >
> > > Flink already has a very long list of configurations [1][2] and this is
> > > very scary
> > > and not easy to use. We should try to remove the unnecessary
> > configuration
> > > from
> > > the list in Flink 2.0. However, from my perspective, adding this option
> > > makes us far
> > > away from this direction.
> > >
> > > Best,
> > > Jark
> > >
> > > [1]
> > >
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/config/
> > > [2]
> > >
> > >
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/
> > >
> > > On Sat, 9 Sept 2023 at 17:33, Jing Ge 
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > Thanks for bringing this to our attention. At the first glance, it
> > looks
> > > > reasonable to offer a new configuration to enable/disable SQL hints
> > > > globally. However, IMHO, it is not the right timing to do it now,
> > because
> > > > we should not only think as platform providers but also as end
> > users(the
> > > > number of end users are much bigger than platform providers):
> > > >
> > > > 1. Users don't need it because users have the choice to use hints or
> > not,
> > > > just like Jark pointed out. With this configuration, there will be a
> > > fight
> > > > between platform providers and users which will cause more confusions
> > and
> > > > conflicts. And users will probably win, IMHO, because they are the
> end
> > > > customers that use Flink to create business values.
> > > > 2. SQL hints could be considered as an additional feature for users
> to
> > > > control, to optimize the execution plan without touching the internal
> > > > logic, i.e. features for advanced use cases and i.e. don't use it if
> > you
> > > > don't understand it.
> > > > 3. Before the system is sma

Re: [DISCUSS] FLIP-328: Allow source operators to determine isProcessingBacklog based on watermark lag

2023-09-14 Thread Jark Wu
Hi Dong,

> Note that we can not simply enforce the semantics of "any invocation of
> setIsProcessingBacklog(false) will set the job's backlog status to false".
> Suppose we have a job with two operators, where operatorA invokes
> setIsProcessingBacklog(false) and operatorB invokes
> setIsProcessingBacklog(true). There will be conflict if we use the
> semantics of "any invocation of setIsProcessingBacklog(false) will set the
> job's backlog status to false".

So it should set job's backlog status to false if the job has only a single
source,
right?

Could you elaborate on the behavior if there is a job with a single source,
and the watermark lag exceeds the configured value (should set backlog to
true?),
but the source invokes "setIsProcessingBacklog(false)"? Or the inverse one,
the source invokes "setIsProcessingBacklog(false)" first, but the watermark
lag
exceeds the configured value.

This is the conflict I'm concerned about.

Best,
Jark

On Fri, 15 Sept 2023 at 12:00, Dong Lin  wrote:

> Hi Jark,
>
> Please see my comments inline.
>
> On Fri, Sep 15, 2023 at 10:35 AM Jark Wu  wrote:
>
> > Hi Dong,
> >
> > Please see my comments inline below.
>
>
> > > Hmm.. can you explain what you mean by "different watermark delay
> > > definitions for each source"?
> >
> > For example, "table1" defines a watermark with delay 5 seconds,
> > "table2" defines a watermark with delay 10 seconds. They have different
> > watermark delay definitions. So it is also reasonable they have different
> > watermark lag definitions, e.g., "table1" allows "10mins" and "table2"
> > allows "20mins".
> >
>
> I think the watermark delay you mentioned above is conceptually /
> fundamentally different from the watermark-lag-threshold proposed in this
> FLIP.
>
> It might be useful to revisit the semantics of these two concepts:
> - watermark delay is used to account for the maximum amount of orderliness
> that users expect (or willing to wait for) for records from a given source.
> - watermark-lag-threshold is used to define when processing latency is no
> longer important (e.g. because data is already stale).
>
> Even though users might expect different out of orderliness for different
> sources, users do not necessarily have different definitions / thresholds
> for when a record is considered "already stale".
>
>
> >
> > > I think there is probably misunderstanding here. FLIP-309 does NOT
> > directly
> > > specify when backlog is false. It is intentionally specified in such a
> > way
> > > that there will  not be any conflict between these rules.
> >
> > Do you mean FLIP-309 doesn't allow to specify backlog to be false?
> > Is this mentioned in FLIP-309? This is completely different from what I
> >
>
> Can you explain what you mean by "allow to specify backlog to be false"?
>
> If what you mean is that "can invoke setIsProcessingBacklog(false)", then
> FLIP-309 supports doing this.
>
> If what you mean is that "any invocation of setIsProcessingBacklog(false)
> will set the job's backlog status to false", then FLIP-309 does not support
> this. I believe the existing Java doc of this API and FLIP-309 is
> compatible with this explanation.
>
> Note that we can not simply enforce the semantics of "any invocation of
> setIsProcessingBacklog(false) will set the job's backlog status to false".
> Suppose we have a job with two operators, where operatorA invokes
> setIsProcessingBacklog(false) and operatorB invokes
> setIsProcessingBacklog(true). There will be conflict if we use the
> semantics of "any invocation of setIsProcessingBacklog(false) will set the
> job's backlog status to false".
>
> Would this answer your question?
>
> Best,
> Dong
>
>
> > understand. From the API interface "ctx.setIsProcessingBacklog(boolean)",
> > it allows users to invoke "setIsProcessingBacklog(false)". And FLIP-309
> > also says "MySQL CDC source should report isProcessingBacklog=false
> > at the beginning of the changelog stage." If not, maybe we need to
> revisit
> > FLIP-309.
>
>
> > Best,
> > Jark
> >
> >
> >
> > On Fri, 15 Sept 2023 at 08:41, Dong Lin  wrote:
> >
> > > Hi Jark,
> > >
> > > Do you have any follow-up comment?
> > >
> > > My gut feeling is that suppose we need to support per-source watermark
> > lag
> > > specification in the future (not sure we have a use

Re: [DISCUSS] FLIP-328: Allow source operators to determine isProcessingBacklog based on watermark lag

2023-09-14 Thread Jark Wu
Hi Dong,

Please see my comments inline below.

> Hmm.. can you explain what you mean by "different watermark delay
> definitions for each source"?

For example, "table1" defines a watermark with delay 5 seconds,
"table2" defines a watermark with delay 10 seconds. They have different
watermark delay definitions. So it is also reasonable they have different
watermark lag definitions, e.g., "table1" allows "10mins" and "table2"
allows "20mins".

> I think there is probably misunderstanding here. FLIP-309 does NOT
directly
> specify when backlog is false. It is intentionally specified in such a way
> that there will  not be any conflict between these rules.

Do you mean FLIP-309 doesn't allow to specify backlog to be false?
Is this mentioned in FLIP-309? This is completely different from what I
understand. From the API interface "ctx.setIsProcessingBacklog(boolean)",
it allows users to invoke "setIsProcessingBacklog(false)". And FLIP-309
also says "MySQL CDC source should report isProcessingBacklog=false
at the beginning of the changelog stage." If not, maybe we need to revisit
FLIP-309.

Best,
Jark



On Fri, 15 Sept 2023 at 08:41, Dong Lin  wrote:

> Hi Jark,
>
> Do you have any follow-up comment?
>
> My gut feeling is that suppose we need to support per-source watermark lag
> specification in the future (not sure we have a use-case for this right
> now), we can add such a config in the future with a follow-up FLIP. The
> job-level config will still be useful as it makes users' configuration
> simpler for common scenarios.
>
> If it is OK, can we agree to make incremental progress for Flink and start
> a voting thread for this FLIP?
>
> Thanks,
> Dong
>
>
> On Mon, Sep 11, 2023 at 4:41 PM Jark Wu  wrote:
>
> > Hi Dong,
> >
> > Please see my comments inline.
> >
> > >  As a result, the proposed job-level
> > > config will be applied only in the changelog stage. So there is no
> > > difference between these two approaches in this particular case, right?
> >
> > How the job-level config can be applied ONLY in the changelog stage?
> > I think it is only possible if it is implemented by the CDC source
> itself,
> > because the framework doesn't know which stage of the source is.
> > Know that the CDC source may emit watermarks with a very small lag
> > in the snapshot stage, and the job-level config may turn the backlog
> > status into false.
> >
> > > On the other hand, per-source config will be necessary if users want to
> > > apply different watermark lag thresholds for different sources in the
> > same
> > > job.
> >
> > We also have different watermark delay definitions for each source,
> > I think this's also reasonable and necessary to have different watermark
> > lags.
> >
> >
> > > Each source can have its own rule that specifies when the backlog can
> be
> > true
> > > (e.g. MySql CDC says the backlog should be true during the snapshot
> > stage).
> > > And we can have a job-level config that specifies when the backlog
> should
> > > be true. Note that it is designed in such a way that none of these
> rules
> > > specify when the backlog should be false. That is why there is no
> > conflict
> > > by definition.
> >
> > IIUC, FLIP-309 provides `setIsProcessingBacklog` to specify when the
> > backlog
> > is true and when is FALSE. This conflicts with the job-level config as it
> > will turn
> > the status into true.
> >
> > > If I understand your comments correctly, you mean that we might have a
> > > Flink SQL DDL with user-defined watermark expressions. And users also
> > want
> > > to set the backlog to true if the watermark generated by that
> > > user-specified expression exceeds a threshold.
> >
> > No. I mean the source may not support generating watermarks, so the
> > watermark
> > expression is applied in a following operator (instead of in the source
> > operator).
> > This will result in the watermark lag doesn't work in this case and
> confuse
> > users.
> >
> > > You are right that this is a limitation. However, this is only a
> > short-term
> > > limitation which we added to make sure that we can focus on the
> > capability
> > > to switch from backlog=true to backlog=false. In the future, we will
> > remove
> > > this limitation and also support switching from backlog=false to
> > > backlog=true.
> >
> > I can understand it may be difficult 

Re: [DISCUSS] FLIP-328: Allow source operators to determine isProcessingBacklog based on watermark lag

2023-09-11 Thread Jark Wu
Hi Dong,

Please see my comments inline.

>  As a result, the proposed job-level
> config will be applied only in the changelog stage. So there is no
> difference between these two approaches in this particular case, right?

How the job-level config can be applied ONLY in the changelog stage?
I think it is only possible if it is implemented by the CDC source itself,
because the framework doesn't know which stage of the source is.
Know that the CDC source may emit watermarks with a very small lag
in the snapshot stage, and the job-level config may turn the backlog
status into false.

> On the other hand, per-source config will be necessary if users want to
> apply different watermark lag thresholds for different sources in the same
> job.

We also have different watermark delay definitions for each source,
I think this's also reasonable and necessary to have different watermark
lags.


> Each source can have its own rule that specifies when the backlog can be
true
> (e.g. MySql CDC says the backlog should be true during the snapshot
stage).
> And we can have a job-level config that specifies when the backlog should
> be true. Note that it is designed in such a way that none of these rules
> specify when the backlog should be false. That is why there is no conflict
> by definition.

IIUC, FLIP-309 provides `setIsProcessingBacklog` to specify when the backlog
is true and when is FALSE. This conflicts with the job-level config as it
will turn
the status into true.

> If I understand your comments correctly, you mean that we might have a
> Flink SQL DDL with user-defined watermark expressions. And users also want
> to set the backlog to true if the watermark generated by that
> user-specified expression exceeds a threshold.

No. I mean the source may not support generating watermarks, so the
watermark
expression is applied in a following operator (instead of in the source
operator).
This will result in the watermark lag doesn't work in this case and confuse
users.

> You are right that this is a limitation. However, this is only a
short-term
> limitation which we added to make sure that we can focus on the capability
> to switch from backlog=true to backlog=false. In the future, we will
remove
> this limitation and also support switching from backlog=false to
> backlog=true.

I can understand it may be difficult to support runtime mode switching back
and forth.
However, I think this should be a limitation of FLIP-327, not FLIP-328.
IIUC,
FLIP-309 doesn't have this limitation, right? I just don't understand
what's the
challenge to switch a flag?

Best,
Jark


On Sun, 10 Sept 2023 at 19:44, Dong Lin  wrote:

> Hi Jark,
>
> Thanks for the comments. Please see my comments inline.
>
> On Sat, Sep 9, 2023 at 4:13 PM Jark Wu  wrote:
>
> > Hi Xuannan,
> >
> > I leave my comments inline.
> >
> > > In the case where a user wants to
> > > use a CDC source and also determine backlog status based on watermark
> > > lag, we still need to define the rule when that occurs
> >
> > This rule should be defined by the source itself (who knows backlog
> best),
> > not by the framework. In the case of CDC source, it reports
> isBacklog=true
> > during snapshot stage, and report isBacklog=false during changelog stage
> if
> > watermark-lag is within the threshold.
> >
>
> I am not sure I fully understand the difference between adding a job-level
> config vs. adding a per-source config.
>
> In the case of CDC, its watermark lag should be either unde-defined or
> really large in the snapshot stage. As a result, the proposed job-level
> config will be applied only in the changelog stage. So there is no
> difference between these two approaches in this particular case, right?
>
> There are two advantages of the job-level config over per-source config:
>
> 1) Configuration is simpler. For example, suppose a user has a Flink job
> that consumes records from multiple Kafka sources and wants to determine
> backlog status for these Kafka sources using the same watermark lag
> threshold, there is no need for users to repeatedly specify this threshold
> for each source.
>
> 2) There is a smaller number of public APIs overall. In particular, instead
> of repeatedly adding a setProcessingBacklogWatermarkLagThreshold() API for
> every source operator that has even-time watermark lag defined, we only
> need to add one job-level config. Less public API means better simplicity
> and maintainability in general.
>
> On the other hand, per-source config will be necessary if users want to
> apply different watermark lag thresholds for different sources in the same
> job. Personally, I find this a bit counter-intuitive for users to specify
> different watermark lag thres

Re: [DISCUSS] [FLINK-32873] Add a config to allow disabling Query hints

2023-09-09 Thread Jark Wu
I agree with Jing,

My biggest concern is this makes the boundary of adding an option very
unclear.
It's not a strong reason to add a config just because of it doesn't affect
existing
users. Does this mean that in the future we might add an option to disable
each feature?

Flink already has a very long list of configurations [1][2] and this is
very scary
and not easy to use. We should try to remove the unnecessary configuration
from
the list in Flink 2.0. However, from my perspective, adding this option
makes us far
away from this direction.

Best,
Jark

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/config/
[2]
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/

On Sat, 9 Sept 2023 at 17:33, Jing Ge  wrote:

> Hi,
>
> Thanks for bringing this to our attention. At the first glance, it looks
> reasonable to offer a new configuration to enable/disable SQL hints
> globally. However, IMHO, it is not the right timing to do it now, because
> we should not only think as platform providers but also as end users(the
> number of end users are much bigger than platform providers):
>
> 1. Users don't need it because users have the choice to use hints or not,
> just like Jark pointed out. With this configuration, there will be a fight
> between platform providers and users which will cause more confusions and
> conflicts. And users will probably win, IMHO, because they are the end
> customers that use Flink to create business values.
> 2. SQL hints could be considered as an additional feature for users to
> control, to optimize the execution plan without touching the internal
> logic, i.e. features for advanced use cases and i.e. don't use it if you
> don't understand it.
> 3. Before the system is smart enough to take over(where we are now,
> fortunately and unfortunately :-))), there should be a way for users to do
> such tuning, even if it is a temporary phase from a
> long-term's perspective, i.e. just because it is a temporary solution, does
> not mean it is not necessary for now.
> 4. What if users write wrong hints? Well, the code review process is
> recommended. Someone who truly understands hints should double check it
> before hints are merged to the master or submitted to the production env.
> Just like a common software development process.
>
> Just my two cents.
>
> Best regards,
> Jing
>
> On Thu, Sep 7, 2023 at 10:02 PM Bonnie Arogyam Varghese
>  wrote:
>
> > Hi Liu,
> >  The default will be set to enabled which is the current behavior. The
> > option will allow users/platform providers to disable it if they want to.
> >
> > On Wed, Sep 6, 2023 at 6:39 PM liu ron  wrote:
> >
> > > Hi, Boonie
> > >
> > > I'm with Jark on why disable hint is needed if it won't affect
> security.
> > If
> > > users don't need to use hint, then they won't care about it and I don't
> > > think it's going to be a nuisance. On top of that, Lookup Join Hint is
> > very
> > > useful for streaming jobs, and disabling the hint would result in users
> > not
> > > being able to use it.
> > >
> > > Best,
> > > Ron
> > >
> > > Bonnie Arogyam Varghese  于2023年9月6日周三
> > > 23:52写道:
> > >
> > > > Hi Liu Ron,
> > > >  To answer your question,
> > > >Security might not be the main reason for disabling this option
> but
> > > > other arguments brought forward by Timo. Let me know if you have any
> > > > further questions or concerns.
> > > >
> > > > On Tue, Sep 5, 2023 at 9:35 PM Bonnie Arogyam Varghese <
> > > > bvargh...@confluent.io> wrote:
> > > >
> > > > > It looks like it will be nice to have a config to disable hints.
> Any
> > > > other
> > > > > thoughts/concerns before we can close this discussion?
> > > > >
> > > > > On Fri, Aug 18, 2023 at 7:43 AM Timo Walther 
> > > wrote:
> > > > >
> > > > >>  > lots of the streaming SQL syntax are extensions of SQL standard
> > > > >>
> > > > >> That is true. But hints are kind of a special case because they
> are
> > > not
> > > > >> even "part of Flink SQL" that's why they are written in a comment
> > > > syntax.
> > > > >>
> > > > >> Anyway, I feel hints could be sometimes confusing for users
> because
> > > most
> > > > >> of them have no effect for streaming and long-term we could also
> set
> >

Re: [DISCUSS] FLIP-328: Allow source operators to determine isProcessingBacklog based on watermark lag

2023-09-09 Thread Jark Wu
,
> > >>>>>>> like pure processing-time based stream processing[1] are not
> > >>>> covered. It
> > >>>>>> is
> > >>>>>>> more or less a trade-off solution that does not support such use
> > >>>> cases
> > >>>>>> and
> > >>>>>>> appropriate documentation is required. Forcing them to explicitly
> > >>>>>> generate
> > >>>>>>> watermarks that are never needed just because of this does not
> > >> sound
> > >>>>>> like a
> > >>>>>>> proper solution.
> > >>>>>>> 3. If I am not mistaken, it only works for use cases where event
> > >>>> times
> > >>>>>> are
> > >>>>>>> very close to the processing times, because the wall clock is
> > >> used to
> > >>>>>>> calculate the watermark lag and the watermark is generated based
> > >> on
> > >>>> the
> > >>>>>>> event time.
> > >>>>>>>
> > >>>>>>> Best regards,
> > >>>>>>> Jing
> > >>>>>>>
> > >>>>>>> [1]
> > >>>>>>>
> > >>>>>>
> > >>>>
> > >>
> https://github.com/apache/flink/blob/2c50b4e956305426f478b726d4de4a640a16b810/flink-core/src/main/java/org/apache/flink/api/common/eventtime/WatermarkStrategy.java#L236
> > >>>>>>>
> > >>>>>>> On Wed, Aug 30, 2023 at 4:06 AM Xuannan Su <
> > >> suxuanna...@gmail.com>
> > >>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> Hi Jing,
> > >>>>>>>>
> > >>>>>>>> Thank you for the suggestion.
> > >>>>>>>>
> > >>>>>>>> The definition of watermark lag is the same as the watermarkLag
> > >>>> metric
> > >>>>>> in
> > >>>>>>>> FLIP-33[1]. More specifically, the watermark lag calculation is
> > >>>>>> computed at
> > >>>>>>>> the time when a watermark is emitted downstream in the following
> > >>>> way:
> > >>>>>>>> watermarkLag = CurrentTime - Watermark. I have added this
> > >>>> description to
> > >>>>>>>> the FLIP.
> > >>>>>>>>
> > >>>>>>>> I hope this addresses your concern.
> > >>>>>>>>
> > >>>>>>>> Best,
> > >>>>>>>> Xuannan
> > >>>>>>>>
> > >>>>>>>> [1]
> > >>>>>>>>
> > >>>>>>
> > >>>>
> > >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-33%3A+Standardize+Connector+Metrics
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>> On Aug 28, 2023, at 01:04, Jing Ge  > >>>
> > >>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>> Hi Xuannan,
> > >>>>>>>>>
> > >>>>>>>>> Thanks for the proposal. +1 for me.
> > >>>>>>>>>
> > >>>>>>>>> There is one tiny thing that I am not sure if I understand it
> > >>>>>> correctly.
> > >>>>>>>>> Since there will be many different WatermarkStrategies and
> > >>>> different
> > >>>>>>>>> WatermarkGenerators. Could you please update the FLIP and add
> > >> the
> > >>>>>>>>> description of how the watermark lag is calculated exactly?
> > >> E.g.
> > >>>>>>>> Watermark
> > >>>>>>>>> lag = A - B with A is the timestamp of the watermark emitted
> > >> to the
> > >>>>>>>>> downstream and B is(this is the part I am not really sure
> > >> after
> > >>>>>>>> reading
> > >>>>>>>>> the FLIP).
> > >>>>>>>>>
> > >

Re: [DISCUSS][FLINK-31788][FLINK-33015] Add back Support emitUpdateWithRetract for TableAggregateFunction

2023-09-07 Thread Jark Wu
+1 to fix it first.

I also agree to deprecate it if there are few people using it,
but this should be another discussion thread within dev+user ML.

In the future, we are planning to introduce user-defined-operator
based on the TVF functionality which I think can fully subsume
the UDTAG, cc @Timo Walther .

Best,
Jark

On Thu, 7 Sept 2023 at 11:44, Jane Chan  wrote:

> Hi devs,
>
> Recently, we noticed an issue regarding a feature regression related to
> Table API. `org.apache.flink.table.functions.TableAggregateFunction`
> provides an API `emitUpdateWithRetract` [1] to cope with updated values,
> but it's not being called in the code generator. As a result, even if users
> override this method, it does not work as intended.
>
> This issue has been present since version 1.15 (when the old planner was
> deprecated), but surprisingly, only two users have raised concerns about it
> [2][3].
>
> So, I would like to initiate a discussion to bring it back. Of course, if
> few users use it, we can also consider deprecating it.
>
> [1]
>
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/functions/udfs/#retraction-example
> [2] https://lists.apache.org/thread/rnvw8k3636dqhdttpmf1c9colbpw9svp
> [3] https://www.mail-archive.com/user-zh@flink.apache.org/msg15230.html
>
> Best,
> Jane
>


Re: [DISCUSS] FLIP-328: Allow source operators to determine isProcessingBacklog based on watermark lag

2023-09-07 Thread Jark Wu
> > in
> > > > > >> FLIP-33[1]. More specifically, the watermark lag calculation is
> > > > > computed at
> > > > > >> the time when a watermark is emitted downstream in the following
> > > way:
> > > > > >> watermarkLag = CurrentTime - Watermark. I have added this
> > > description to
> > > > > >> the FLIP.
> > > > > >>
> > > > > >> I hope this addresses your concern.
> > > > > >>
> > > > > >> Best,
> > > > > >> Xuannan
> > > > > >>
> > > > > >> [1]
> > > > > >>
> > > > >
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-33%3A+Standardize+Connector+Metrics
> > > > > >>
> > > > > >>
> > > > > >>> On Aug 28, 2023, at 01:04, Jing Ge  >
> > > wrote:
> > > > > >>>
> > > > > >>> Hi Xuannan,
> > > > > >>>
> > > > > >>> Thanks for the proposal. +1 for me.
> > > > > >>>
> > > > > >>> There is one tiny thing that I am not sure if I understand it
> > > > > correctly.
> > > > > >>> Since there will be many different WatermarkStrategies and
> > > different
> > > > > >>> WatermarkGenerators. Could you please update the FLIP and add
> the
> > > > > >>> description of how the watermark lag is calculated exactly?
> E.g.
> > > > > >> Watermark
> > > > > >>> lag = A - B with A is the timestamp of the watermark emitted
> to the
> > > > > >>> downstream and B is(this is the part I am not really sure
> after
> > > > > >> reading
> > > > > >>> the FLIP).
> > > > > >>>
> > > > > >>> Best regards,
> > > > > >>> Jing
> > > > > >>>
> > > > > >>>
> > > > > >>> On Mon, Aug 21, 2023 at 9:03 AM Xuannan Su <
> suxuanna...@gmail.com>
> > > > > >> wrote:
> > > > > >>>
> > > > > >>>> Hi Jark,
> > > > > >>>>
> > > > > >>>> Thanks for the comments.
> > > > > >>>>
> > > > > >>>> I agree that the current solution cannot support jobs that
> cannot
> > > > > define
> > > > > >>>> watermarks. However, after considering the
> pending-record-based
> > > > > >> solution, I
> > > > > >>>> believe the current solution is superior for the target use
> case
> > > as it
> > > > > >> is
> > > > > >>>> more intuitive for users. The backlog status gives users the
> > > ability
> > > > > to
> > > > > >>>> balance between throughput and latency. Making this trade-off
> > > decision
> > > > > >>>> based on the watermark lag is more intuitive from the user's
> > > > > >> perspective.
> > > > > >>>> For instance, a user can decide that if the job lags behind
> the
> > > > > current
> > > > > >>>> time by more than 1 hour, the result is not usable. In that
> case,
> > > we
> > > > > can
> > > > > >>>> optimize for throughput when the data lags behind by more
> than an
> > > > > hour.
> > > > > >>>> With the pending-record-based solution, it's challenging for
> > > users to
> > > > > >>>> determine when to optimize for throughput and when to
> prioritize
> > > > > >> latency.
> > > > > >>>>
> > > > > >>>> Regarding the limitations of the watermark-based solution:
> > > > > >>>>
> > > > > >>>> 1. The current solution can support jobs with sources that
> have
> > > event
> > > > > >>>> time. Users can always define a watermark at the source
> operator,
> > > even
> > > > > >> if
> > > > > >>>> it's not used by downstream operators, 

Re: [DISCUSS] Update Flink Roadmap

2023-09-01 Thread Jark Wu
Thank you all for helping with the roadmap documentation.
I have merged the roadmap pull request.

Cheers,
Jark

On Wed, 23 Aug 2023 at 15:26, Jing Ge  wrote:

> Thanks Jark, +1 for the OLAP :-)
>
> Best regards,
> Jing
>
> On Sun, Aug 20, 2023 at 5:04 PM Jark Wu  wrote:
>
> > Hi all,
> >
> > I have addressed another bunch of comments on the Google doc (mainly
> about
> > the OLAP roadmap).
> > And I have opened a pull request for the website:
> > https://github.com/apache/flink-web/pull/672
> >
> > Please help to review it and continue the discussion on the pull request,
> > thanks a lot!
> >
> > Best,
> > Jark
> >
> > On Tue, 15 Aug 2023 at 12:15, Xintong Song 
> wrote:
> >
> > > Thanks for driving this, Jark.
> > >
> > > The current draft looks good to me. I think it is good to open a PR
> with
> > > it. And if there are other comments, we can discuss them during the PR
> > > review.
> > >
> > > I also added a few minor comments in the draft regarding the feature
> > radar.
> > > Those can also be discussed on the PR.
> > >
> > > Best,
> > >
> > > Xintong
> > >
> > >
> > >
> > > On Tue, Aug 15, 2023 at 11:15 AM Shammon FY  wrote:
> > >
> > > > Hi Jark,
> > > >
> > > > Sounds good and I would love to, thanks! I will involve you and
> > Xingtong
> > > > on the document after updating.
> > > >
> > > > Best,
> > > > Shammon FY
> > > >
> > > >
> > > > On Mon, Aug 14, 2023 at 10:39 PM Jark Wu  wrote:
> > > >
> > > >> Hi Shammon,
> > > >>
> > > >> Sure, could you help to draft a subsection about this in the google
> > doc?
> > > >>
> > > >> Best,
> > > >> Jark
> > > >>
> > > >> 2023年8月14日 20:30,Shammon FY  写道:
> > > >>
> > > >> Thanks @Jark for driving the Flink Roadmap.
> > > >>
> > > >> As we discussed olap in the thread [1] and according to the
> > suggestions
> > > >> from @Xingtong Song, could we add a subsection in `Towards Streaming
> > > >> Warehouses` or `Performance` that the short-lived query in Flink
> > Session
> > > >> Cluster is one of the future directions for Flink?
> > > >>
> > > >> Best,
> > > >> Shammon FY
> > > >>
> > > >> On Mon, Aug 14, 2023 at 8:03 PM Jark Wu  wrote:
> > > >>
> > > >>> Thank you everyone for helping polish the roadmap [1].
> > > >>>
> > > >>> I think I have addressed all the comments and we have included all
> > > >>> ongoing
> > > >>> parts of Flink.
> > > >>> Please feel free to take a last look. I'm going to prepare the pull
> > > >>> request
> > > >>> if there are no more concerns.
> > > >>>
> > > >>> Best,
> > > >>> Jark
> > > >>>
> > > >>> [1]:
> > > >>>
> > > >>>
> > >
> >
> https://docs.google.com/document/d/12BDiVKEsY-f7HI3suO_IxwzCmR04QcVqLarXgyJAb7c/edit
> > > >>>
> > > >>> On Sun, 13 Aug 2023 at 13:04, Yuan Mei 
> > wrote:
> > > >>>
> > > >>> > Sorry for taking so long
> > > >>> >
> > > >>> > I've added a section about Flink Disaggregated State Management
> > > >>> Evolution
> > > >>> > in the attached doc.
> > > >>> >
> > > >>> > I found some of the contents might be overlapped with the
> > > "large-scale
> > > >>> > streaming jobs". So that part might need some changes as well.
> > > >>> >
> > > >>> > Please let me know what you think.
> > > >>> >
> > > >>> > Best
> > > >>> > Yuan
> > > >>> >
> > > >>> > On Mon, Jul 24, 2023 at 12:07 PM Yuan Mei <
> yuanmei.w...@gmail.com>
> > > >>> wrote:
> > > >>> >
> > > >>> > > Sorry have missed this email and respond a bit late.
> > > >>> > >
> > > >>> > > I will put a draft for

Re: [VOTE] FLIP-356: Support Nested Fields Filter Pushdown

2023-08-31 Thread Jark Wu
+1 (binding)

Best,
Jark

> 2023年8月30日 02:40,Venkatakrishnan Sowrirajan  写道:
> 
> Hi everyone,
> 
> Thank you all for your feedback on FLIP-356. I'd like to start a vote.
> 
> Discussion thread:
> https://lists.apache.org/thread/686bhgwrrb4xmbfzlk60szwxos4z64t7
> FLIP:
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-356%3A+Support+Nested+Fields+Filter+Pushdown
> 
> Regards
> Venkata krishnan



Re: [VOTE] FLIP-348: Make expanding behavior of virtual metadata columns configurable

2023-08-31 Thread Jark Wu
+1 (binding)

Best,
Jark

> 2023年8月31日 18:54,Jing Ge  写道:
> 
> +1(binding)
> 
> On Thu, Aug 31, 2023 at 11:22 AM Sergey Nuyanzin 
> wrote:
> 
>> +1 (binding)
>> 
>> On Thu, Aug 31, 2023 at 9:28 AM Benchao Li  wrote:
>> 
>>> +1 (binding)
>>> 
>>> Martijn Visser  于2023年8月31日周四 15:24写道:
 
 +1 (binding)
 
 On Thu, Aug 31, 2023 at 9:09 AM Timo Walther 
>> wrote:
 
> Hi everyone,
> 
> I'd like to start a vote on FLIP-348: Make expanding behavior of
>>> virtual
> metadata columns configurable [1] which has been discussed in this
> thread [2].
> 
> The vote will be open for at least 72 hours unless there is an
>>> objection
> or not enough votes.
> 
> [1] https://cwiki.apache.org/confluence/x/_o6zDw
> [2] https://lists.apache.org/thread/zon967w7synby8z6m1s7dj71dhkh9ccy
> 
> Cheers,
> Timo
> 
>>> 
>>> 
>>> 
>>> --
>>> 
>>> Best,
>>> Benchao Li
>>> 
>> 
>> 
>> --
>> Best regards,
>> Sergey
>> 



Re: [DISCUSS] FLIP-356: Support Nested Fields Filter Pushdown

2023-08-29 Thread Jark Wu
I'm fine with this. `ReferenceExpression` and `SupportsProjectionPushDown`
can be another FLIP. However, could you summarize the design of this part
in the future part of the FLIP? This can be easier to get started with in
the future.


Best,
Jark


On Wed, 30 Aug 2023 at 02:45, Venkatakrishnan Sowrirajan 
wrote:

> Thanks Jark. Sounds good.
>
> One more thing, earlier in my summary I mentioned,
>
> Introduce a new *ReferenceExpression* (or *BaseReferenceExpression*)
> > abstract class which will be extended by both *FieldReferenceExpression*
> >  and *NestedFieldReferenceExpression* (to be introduced as part of this
> > FLIP)
>
> This can be punted for now and can be handled as part of refactoring
> SupportsProjectionPushDown.
>
> Also I made minor changes to the *NestedFieldReferenceExpression, *instead
> of *fieldIndexArray* we can just do away with *fieldNames *array that
> includes fieldName at every level for the nested field.
>
> Updated the FLIP-357
> <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-356%3A+Support+Nested+Fields+Filter+Pushdown
> >
> wiki as well.
>
> Regards
> Venkata krishnan
>
>
> On Tue, Aug 29, 2023 at 5:21 AM Jark Wu  wrote:
>
> > Hi Venkata,
> >
> > Your summary looks good to me. +1 to start a vote.
> >
> > I think we don't need "inputIndex" in NestedFieldReferenceExpression.
> > Actually, I think it is also not needed in FieldReferenceExpression,
> > and we should try to remove it (another topic). The RexInputRef in
> Calcite
> > also doesn't require an inputIndex because the field index should
> represent
> > index of the field in the underlying row type. Field references shouldn't
> > be
> >  aware of the number of inputs.
> >
> > Best,
> > Jark
> >
> >
> > On Tue, 29 Aug 2023 at 02:24, Venkatakrishnan Sowrirajan <
> vsowr...@asu.edu
> > >
> > wrote:
> >
> > > Hi Jinsong,
> > >
> > > Thanks for your comments.
> > >
> > > What is inputIndex in NestedFieldReferenceExpression?
> > >
> > >
> > > I haven't looked at it before. Do you mean, given that it is now only
> > used
> > > to push filters it won't be subsequently used in further
> > > planning/optimization and therefore it is not required at this time?
> > >
> > > So if NestedFieldReferenceExpression doesn't need inputIndex, is there
> > > > a need to introduce a base class `ReferenceExpression`?
> > >
> > > For SupportsFilterPushDown itself, *ReferenceExpression* base class is
> > not
> > > needed. But there were discussions around cleaning up and standardizing
> > the
> > > API for Supports*PushDown. SupportsProjectionPushDown currently pushes
> > the
> > > projects as a 2-d array, instead it would be better to use the standard
> > API
> > > which seems to be the *ResolvedExpression*. For
> > SupportsProjectionPushDown
> > > either FieldReferenceExpression (top level fields) or
> > > NestedFieldReferenceExpression (nested fields) is enough, in order to
> > > provide a single API that handles both top level and nested fields,
> > > ReferenceExpression will be introduced as a base class.
> > >
> > > Eventually, *SupportsProjectionPushDown#applyProjections* would evolve
> as
> > > applyProjection(List projectedFields) and nested
> > > fields would be pushed only if *supportsNestedProjections* returns
> true.
> > >
> > > Regards
> > > Venkata krishnan
> > >
> > >
> > > On Sun, Aug 27, 2023 at 11:12 PM Jingsong Li 
> > > wrote:
> > >
> > > > So if NestedFieldReferenceExpression doesn't need inputIndex, is
> there
> > > > a need to introduce a base class `ReferenceExpression`?
> > > >
> > > > Best,
> > > > Jingsong
> > > >
> > > > On Mon, Aug 28, 2023 at 2:09 PM Jingsong Li 
> > > > wrote:
> > > > >
> > > > > Hi thanks all for your discussion.
> > > > >
> > > > > What is inputIndex in NestedFieldReferenceExpression?
> > > > >
> > > > > I know inputIndex has special usage in FieldReferenceExpression,
> but
> > > > > it is only for Join operators, and it is only for SQL optimization.
> > It
> > > > > looks like there is no requirement for Nested.
> > > > >
> > > > > Best,
> > > > > Jingsong
&

Re: [DISCUSS] FLIP-356: Support Nested Fields Filter Pushdown

2023-08-29 Thread Jark Wu
ion. In the long
> > run, I
> > > > > > tend to abstract a basic class for NestedFieldReferenceExpression
> > and
> > > > > > FieldReferenceExpression as u suggested.
> > > > > >
> > > > > > 2. Personally, I don't recommend introducing
> > *supportsNestedFilters() in
> > > > > > supportsFilterPushdown. We just need to better declare the return
> > value
> > > > > of
> > > > > > the method *applyFilters.
> > > > > >
> > > > > > 3. Finally, I think we need to look at the costs and benefits of
> > unifying
> > > > > > the SupportsFilterPushDown and SupportsProjectionPushDown (or
> > others)
> > > > > from
> > > > > > the perspective of interface implementers. A stable API can
> reduce
> > user
> > > > > > development and change costs, if the current API can fully meet
> the
> > > > > > functional requirements at the framework level, I personal
> suggest
> > > > > reducing
> > > > > > the impact on connector developers.
> > > > > >
> > > > > > Regards,
> > > > > > Yunhong Zheng (Swuferhong)
> > > > > >
> > > > > >
> > > > > > Venkatakrishnan Sowrirajan  于2023年8月25日周五
> > 01:25写道:
> > > > > >
> > > > > > > To keep it backwards compatible, introduce another API
> > *applyAggregates
> > > > > > > *with
> > > > > > > *List *when nested field support is added
> > and
> > > > > > > deprecate the current API. This will by default throw an
> > exception. In
> > > > > > > flink planner, *applyAggregates *with nested fields and if it
> > throws
> > > > > > > exception then *applyAggregates* without nested fields.
> > > > > > >
> > > > > > > Regards
> > > > > > > Venkata krishnan
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Aug 24, 2023 at 10:13 AM Venkatakrishnan Sowrirajan <
> > > > > > > vsowr...@asu.edu> wrote:
> > > > > > >
> > > > > > > > Jark,
> > > > > > > >
> > > > > > > > How about having a separate NestedFieldReferenceExpression,
> and
> > > > > > > >> abstracting a common base class "ReferenceExpression" for
> > > > > > > >> NestedFieldReferenceExpression and FieldReferenceExpression?
> > This
> > > > > > makes
> > > > > > > >> unifying expressions in
> > > > > > > >>
> > > > >
> > "SupportsProjectionPushdown#applyProjections(List
> > > > > > > >> ...)"
> > > > > > > >> possible.
> > > > > > > >
> > > > > > > > This should be fine for *SupportsProjectionPushDown* and
> > > > > > > > *SupportsFilterPushDown*. One concern in the case of
> > > > > > > > *SupportsAggregatePushDown* with nested fields support (to be
> > added
> > > > > in
> > > > > > > > the future), with this proposal, the API will become
> backwards
> > > > > > > incompatible
> > > > > > > > as the *args *for the aggregate function is
> > > > > > > *List
> > > > > > > > *that needs to change to *List*.
> > > > > > > >
> > > > > > > > Regards
> > > > > > > > Venkata krishnan
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Aug 24, 2023 at 1:18 AM Jark Wu 
> > wrote:
> > > > > > > >
> > > > > > > >> Hi Becket,
> > > > > > > >>
> > > > > > > >> I think it is the second case, that a
> > FieldReferenceExpression is
> > > > > > > >> constructed
> > > > > > > >> by the framework and passed to the connector (interfaces
> > listed by
> > > > > > > >> Venkata[1]
> > > > > > > >> and Catalog#listPartitionsByFilter). Besides, understanding
> > the
> > > > > n

Re: [DISCUSS] FLIP-356: Support Nested Fields Filter Pushdown

2023-08-24 Thread Jark Wu
 constructed by the framework and passed to
> user
> > or connector / plugin developers.
> >
> > For the first case, both of the approaches provide the same migration
> > experience.
> >
> > For the second case, generally speaking, introducing
> > NestedFieldReferenceExpression and extending FieldReferenceExpression
> would
> > have the same impact for backwards compatibility. SupportsFilterPushDown
> is
> > a special case here because understanding the filter expressions is
> > optional for the source implementation. In other use cases, if
> > understanding the reference to a nested field is a must have, the user
> code
> > has to be changed, regardless of which approach we take to support nested
> > fields.
> >
> > Therefore, I think we have to check each public API where the nested
> field
> > reference is exposed. If we have many public APIs where understanding
> > nested fields is optional for the user  / plugin / connector developers,
> > having a separate NestedFieldReferenceExpression would have a more smooth
> > migration. Otherwise, there seems to be no difference between the two
> > approaches.
> >
> > Migration path aside, the main reason I prefer extending
> > FieldReferenceExpression over a new NestedFieldReferenceExpression is
> > because this makes the SupportsProjectionPushDown interface simpler.
> > Otherwise, we have to treat it as a special case that does not match the
> > overall API style. Or we have to introduce two different
> applyProjections()
> > methods for FieldReferenceExpression / NestedFieldReferenceExpression
> > respectively. This issue further extends to implementation in addition to
> > public API. A single FieldReferenceExpression might help simplify the
> > implementation code a little bit. For example, in a recursive processing
> of
> > a row with nested rows, we may not need to switch between
> > FieldReferenceExpression and NestedFieldReferenceExpression depending on
> > whether the record being processed is a top level record or nested
> record.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> > On Tue, Aug 22, 2023 at 11:43 PM Jark Wu  wrote:
> >
> > > Hi Becket,
> > >
> > > I totally agree we should try to have a consistent API for a final
> state.
> > > The only concern I have mentioned is the "smooth" migration path.
> > > The FiledReferenceExpression is widely used in many public APIs,
> > > not only in the SupportsFilterPushDown. Yes, we can change every
> > > methods in 2-steps, but is it good to change API back and forth for
> this?
> > > Personally, I'm fine with a separate NestedFieldReferenceExpression
> > class.
> > > TBH, I prefer the separated way because it makes the reference
> expression
> > > more clear and concise.
> > >
> > > Best,
> > > Jark
> > >
> > >
> > > On Tue, 22 Aug 2023 at 16:53, Becket Qin  wrote:
> > >
> > > > Thanks for the reply, Jark.
> > > >
> > > > I think it will be helpful to understand the final state we want to
> > > > eventually achieve first, then we can discuss the steps towards that
> > > final
> > > > state.
> > > >
> > > > It looks like there are two proposed end states now:
> > > >
> > > > 1. Have a separate NestedFieldReferenceExpression class; keep
> > > > SupportsFilterPushDown and SupportsProjectionPushDown the same. It is
> > > just
> > > > a one step change.
> > > >- Regarding the supportsNestedFilterPushDown() method, if our
> > contract
> > > > with the connector developer today is "The implementation should
> ignore
> > > > unrecognized expressions by putting them into the remaining filters,
> > > > instead of throwing exceptions". Then there is no need for this
> > method. I
> > > > am not sure about the current contract. We should probably make it
> > clear
> > > in
> > > > the interface Java doc.
> > > >
> > > > 2. Extend the existing FiledReferenceExpression class to support
> nested
> > > > fields; SupportsFilterPushDown only has one method of
> > > > applyFilters(List); SupportsProjectionPushDown
> only
> > > has
> > > > one method of applyProjections(List,
> > DataType).
> > > > It could just be two steps if we are not too obsessed with the exact
> > > names
> > > > of "

Re: [DISCUSS] FLIP-356: Support Nested Fields Filter Pushdown

2023-08-22 Thread Jark Wu
Hi Becket,

I totally agree we should try to have a consistent API for a final state.
The only concern I have mentioned is the "smooth" migration path.
The FiledReferenceExpression is widely used in many public APIs,
not only in the SupportsFilterPushDown. Yes, we can change every
methods in 2-steps, but is it good to change API back and forth for this?
Personally, I'm fine with a separate NestedFieldReferenceExpression class.
TBH, I prefer the separated way because it makes the reference expression
more clear and concise.

Best,
Jark


On Tue, 22 Aug 2023 at 16:53, Becket Qin  wrote:

> Thanks for the reply, Jark.
>
> I think it will be helpful to understand the final state we want to
> eventually achieve first, then we can discuss the steps towards that final
> state.
>
> It looks like there are two proposed end states now:
>
> 1. Have a separate NestedFieldReferenceExpression class; keep
> SupportsFilterPushDown and SupportsProjectionPushDown the same. It is just
> a one step change.
>- Regarding the supportsNestedFilterPushDown() method, if our contract
> with the connector developer today is "The implementation should ignore
> unrecognized expressions by putting them into the remaining filters,
> instead of throwing exceptions". Then there is no need for this method. I
> am not sure about the current contract. We should probably make it clear in
> the interface Java doc.
>
> 2. Extend the existing FiledReferenceExpression class to support nested
> fields; SupportsFilterPushDown only has one method of
> applyFilters(List); SupportsProjectionPushDown only has
> one method of applyProjections(List, DataType).
> It could just be two steps if we are not too obsessed with the exact names
> of "applyFilters" and "applyProjections". More specifically, it takes two
> steps to achieve this final state:
> a. introduce a new method tryApplyFilters(List) to
> SupportsFilterPushDown, which may have FiledReferenceExpression with nested
> fields. The default implementation throws an exception. The runtime will
> first call tryApplyFilters() with nested fields. In case of exception, it
> calls the existing applyFilters() without including the nested filters.
> Similarly, in SupportsProjectionPushDown, introduce a
> tryApplyProjections method returning a Result.
> The Result also contains the accepted and unapplicable projections. The
> default implementation also throws an exception. Deprecate all the other
> methods except tryApplyFilters() and tryApplyProjections().
> b. remove the deprecated methods in the next major version bump.
>
> Now the question is putting the migration steps aside, which end state do
> we prefer? While the first end state is acceptable for me, personally, I
> prefer the latter if we are designing from scratch. It is clean, consistent
> and intuitive. Given the size of Flink, keeping APIs in the same style over
> time is important. The migration is also not that complicated.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
> On Tue, Aug 22, 2023 at 2:23 PM Jark Wu  wrote:
>
> > Hi Venkat,
> >
> > Thanks for the proposal.
> >
> > I have some minor comments about the FLIP.
> >
> > 1. I think we don't need to
> > add SupportsFilterPushDown#supportsNestedFilters() method,
> > because connectors can skip nested filters by putting them in
> > Result#remainingFilters().
> > And this is backward-compatible because unknown expressions were added to
> > the remaining filters.
> > Planner should push predicate expressions as more as possible. If we add
> a
> > flag for each new filter,
> > the interface will be filled with lots of flags (e.g., supportsBetween,
> > supportsIN).
> >
> > 2. NestedFieldReferenceExpression#nestedFieldName should be an array of
> > field names?
> > Each string represents a field name part of the field path. Just keep
> > aligning with `nestedFieldIndexArray`.
> >
> > 3. My concern about making FieldReferenceExpression support nested fields
> > is the compatibility.
> > It is a public API and users/connectors are already using it. People
> > assumed it is a top-level column
> > reference, and applied logic on it. But that's not true now and this may
> > lead to unexpected errors.
> > Having a separate NestedFieldReferenceExpression sounds safer to me.
> Mixing
> > them in a class may
> >  confuse users what's the meaning of getFieldName() and getFieldIndex().
> >
> >
> > Regarding using NestedFieldReferenceExpression in
> > SupportsProjectionPushDown, do you
> > have any concerns @Timo Walther  ?
> >
> > Best,
> > Jark
> >

Re: [DISCUSS] FLIP-356: Support Nested Fields Filter Pushdown

2023-08-21 Thread Jark Wu
Hi Venkat,

Thanks for the proposal.

I have some minor comments about the FLIP.

1. I think we don't need to
add SupportsFilterPushDown#supportsNestedFilters() method,
because connectors can skip nested filters by putting them in
Result#remainingFilters().
And this is backward-compatible because unknown expressions were added to
the remaining filters.
Planner should push predicate expressions as more as possible. If we add a
flag for each new filter,
the interface will be filled with lots of flags (e.g., supportsBetween,
supportsIN).

2. NestedFieldReferenceExpression#nestedFieldName should be an array of
field names?
Each string represents a field name part of the field path. Just keep
aligning with `nestedFieldIndexArray`.

3. My concern about making FieldReferenceExpression support nested fields
is the compatibility.
It is a public API and users/connectors are already using it. People
assumed it is a top-level column
reference, and applied logic on it. But that's not true now and this may
lead to unexpected errors.
Having a separate NestedFieldReferenceExpression sounds safer to me. Mixing
them in a class may
 confuse users what's the meaning of getFieldName() and getFieldIndex().


Regarding using NestedFieldReferenceExpression in
SupportsProjectionPushDown, do you
have any concerns @Timo Walther  ?

Best,
Jark



On Tue, 22 Aug 2023 at 05:55, Venkatakrishnan Sowrirajan 
wrote:

> Sounds like a great suggestion, Becket. +1. Agree with cleaning up the APIs
> and making it consistent in all the pushdown APIs.
>
> Your suggested approach seems fine to me, unless anyone else has any other
> concerns. Just have couple of clarifying questions:
>
> 1. Do you think we should standardize the APIs across all the pushdown
> supports like SupportsPartitionPushdown, SupportsDynamicFiltering etc in
> the end state?
>
> The current proposal works if we do not want to migrate
> > SupportsFilterPushdown to also use NestedFieldReferenceExpression in the
> > long term.
> >
> Did you mean *FieldReferenceExpression* instead of
> *NestedFieldReferenceExpression*?
>
> 2. Extend the FieldReferenceExpression to support nested fields.
> > - Change the index field type from int to int[].
>
> - Add a new method int[] getFieldIndexArray().
> > - Deprecate the int getFieldIndex() method, the code will be removed
> in
> > the next major version bump.
>
> I assume getFieldIndex would return fieldIndexArray[0], right?
>
> Thanks
> Venkat
>
> On Fri, Aug 18, 2023 at 4:47 PM Becket Qin  wrote:
>
> > Thanks for the proposal, Venkata.
> >
> > The current proposal works if we do not want to migrate
> > SupportsFilterPushdown to also use NestedFieldReferenceExpression in the
> > long term.
> >
> Did you mean *FieldReferenceExpression* instead of
> *NestedFieldReferenceExpression*?
>
> >
> > Otherwise, the alternative solution briefly mentioned in the rejected
> > alternatives would be the following:
> > Phase 1:
> > 1. Introduce a supportsNestedFilters() method to the
> SupportsFilterPushdown
> > interface. (same as current proposal).
> > 2. Extend the FieldReferenceExpression to support nested fields.
> > - Change the index field type from int to int[].
>
> - Add a new method int[] getFieldIndexArray().
> > - Deprecate the int getFieldIndex() method, the code will be removed
> in
> > the next major version bump.
>
>
>
> 3. In the SupportsProjectionPushDown interface
> > - add a new method applyProjection(List,
> > DataType), with default implementation invoking applyProjection(int[][],
> > DataType)
> > - deprecate the current applyProjection(int[][], DataType) method
> >
> > Phase 2 (in the next major version bump)
> > 1. remove the deprecated methods.
> >
> > Phase 3 (optional)
> > 1. deprecate and remove the supportsNestedFilters() /
> > supportsNestedProjection() methods from the SupportsFilterPushDown /
> > SupportsProjectionPushDown interfaces.
> >
> > Personally I prefer this alternative. It takes longer to finish the work,
> > but the API eventually becomes clean and consistent. But I can live with
> > the current proposal.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Sat, Aug 19, 2023 at 12:09 AM Venkatakrishnan Sowrirajan <
> > vsowr...@asu.edu> wrote:
> >
> > > Gentle ping for reviews/feedback.
> > >
> > > On Tue, Aug 15, 2023, 5:37 PM Venkatakrishnan Sowrirajan <
> > vsowr...@asu.edu
> > > >
> > > wrote:
> > >
> > > > Hi All,
> > > >
> > > > I am opening this thread to discuss FLIP-356: Support Nested Fields
> > > > Filter Pushdown. The FLIP can be found at
> > > >
> > >
> >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-356*3A*Support*Nested*Fields*Filter*Pushdown__;JSsrKysr!!IKRxdwAv5BmarQ!clxXJwshKpn559SAkQiieqgGe0ZduXCzUKCmYLtFIbQLmrmEEgdmuEIM8ZM1M3O_uGqOploU4ailqGpukAg$
> > > >
> > > > This FLIP adds support for pushing down nested fields filters to the
> > > > underlying TableSource. In our data lake, we find a lot of datasets
> > have
> > 

Re: [DISCUSS] FLIP-328: Allow source operators to determine isProcessingBacklog based on watermark lag

2023-08-20 Thread Jark Wu
Hi Xuannan,

Thanks for opening this discussion.

This current proposal may work in the mentioned watermark cases.
However, it seems this is not a general solution for sources to determine
"isProcessingBacklog".
>From my point of view, there are 3 limitations of the current proposal:
1. It doesn't cover jobs that don't have watermark/event-time defined,
for example streaming join and unbounded aggregate. We may still need to
figure out solutions for them.
2. Watermark lag can not be trusted, because it increases unlimited if no
data is generated in the Kafka.
But in this case, there is no backlog at all.
3. Watermark lag is hard to reflect the amount of backlog. If the watermark
lag is 1day or 1 hour or 1second,
there is possibly only 1 pending record there, which means no backlog at
all.

Therefore, IMO, watermark maybe not the ideal metric used to determine
"isProcessingBacklog".
What we need is something that reflects the number of records unprocessed
by the job.
Actually, that is the "pendingRecords" metric proposed in FLIP-33 and has
been implemented by Kafka source.
Did you consider using "pendingRecords" metric to determine
"isProcessingBacklog"?

Best,
Jark


[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-33%3A+Standardize+Connector+Metrics



On Tue, 15 Aug 2023 at 12:04, Xintong Song  wrote:

> Sounds good to me.
>
> It is true that, if we are introducing the generalized watermark, there
> will be other watermark related concepts / configurations that need to be
> updated anyway.
>
>
> Best,
>
> Xintong
>
>
>
> On Tue, Aug 15, 2023 at 11:30 AM Xuannan Su  wrote:
>
> > Hi Xingtong,
> >
> > Thank you for your suggestion.
> >
> > After considering the idea of using a general configuration key, I think
> > it may not be a good idea for the reasons below.
> >
> > While I agree that using a more general configuration key provides us
> with
> > the flexibility to switch to other approaches to calculate the lag in the
> > future, the downside is that it may cause confusion for users. We
> currently
> > have fetchEventTimeLag, emitEventTimeLag, and watermarkLag in the source,
> > and it is not clear which specific lag we are referring to. With the
> > potential introduction of the Generalized Watermark mechanism in the
> > future, if I understand correctly, a watermark won't necessarily need to
> be
> > a timestamp. I am concern that the general configuration key may not  be
> > enough to cover all the use case and we will need to introduce a general
> > way to determine the backlog status regardless.
> >
> > For the reasons above, I prefer introducing the configuration as is, and
> > change it later with the a deprecation process or migration process. What
> > do you think?
> >
> > Best,
> > Xuannan
> > On Aug 14, 2023, 14:09 +0800, Xintong Song ,
> wrote:
> > > Thanks for the explanation.
> > >
> > > I wonder if it makes sense to not expose this detail via the
> > configuration
> > > option. To be specific, I suggest not mentioning the "watermark"
> keyword
> > in
> > > the configuration key and description.
> > >
> > > - From the users' perspective, I think they only need to know there's a
> > > lag higher than the given threshold, Flink will consider latency of
> > > individual records as less important and prioritize throughput over it.
> > > They don't really need the details of how the lags are calculated.
> > > - For the internal implementation, I also think using watermark lags is
> > > a good idea, for the reasons you've already mentioned. However, it's
> not
> > > the only possible option. Hiding this detail from users would give us
> the
> > > flexibility to switch to other approaches if needed in future.
> > > - We are currently working on designing the ProcessFunction API
> > > (consider it as a DataStream API V2). There's an idea to introduce a
> > > Generalized Watermark mechanism, where basically the watermark can be
> > > anything that needs to travel along the data-flow with certain
> alignment
> > > strategies, and event time watermark would be one specific case of it.
> > This
> > > is still an idea and has not been discussed and agreed on by the
> > community,
> > > and we are preparing a FLIP for it. But if we are going for it, the
> > concept
> > > "watermark-lag-threshold" could be ambiguous.
> > >
> > > I do not intend to block the FLIP on this. I'd also be fine with
> > > introducing the configuration as is, and changing it later, if needed,
> > with
> > > a regular deprecation and migration process. Just making my
> suggestions.
> > >
> > >
> > > Best,
> > >
> > > Xintong
> > >
> > >
> > >
> > > On Mon, Aug 14, 2023 at 12:00 PM Xuannan Su 
> > wrote:
> > >
> > > > Hi Xintong,
> > > >
> > > > Thanks for the reply.
> > > >
> > > > I have considered using the timestamp in the records to determine the
> > > > backlog status, and decided to use watermark at the end. By
> definition,
> > > > watermark is the time progress indication in the data stream. It
> > indicates
> 

Re: [DISCUSS] Update Flink Roadmap

2023-08-20 Thread Jark Wu
Hi all,

I have addressed another bunch of comments on the Google doc (mainly about
the OLAP roadmap).
And I have opened a pull request for the website:
https://github.com/apache/flink-web/pull/672

Please help to review it and continue the discussion on the pull request,
thanks a lot!

Best,
Jark

On Tue, 15 Aug 2023 at 12:15, Xintong Song  wrote:

> Thanks for driving this, Jark.
>
> The current draft looks good to me. I think it is good to open a PR with
> it. And if there are other comments, we can discuss them during the PR
> review.
>
> I also added a few minor comments in the draft regarding the feature radar.
> Those can also be discussed on the PR.
>
> Best,
>
> Xintong
>
>
>
> On Tue, Aug 15, 2023 at 11:15 AM Shammon FY  wrote:
>
> > Hi Jark,
> >
> > Sounds good and I would love to, thanks! I will involve you and Xingtong
> > on the document after updating.
> >
> > Best,
> > Shammon FY
> >
> >
> > On Mon, Aug 14, 2023 at 10:39 PM Jark Wu  wrote:
> >
> >> Hi Shammon,
> >>
> >> Sure, could you help to draft a subsection about this in the google doc?
> >>
> >> Best,
> >> Jark
> >>
> >> 2023年8月14日 20:30,Shammon FY  写道:
> >>
> >> Thanks @Jark for driving the Flink Roadmap.
> >>
> >> As we discussed olap in the thread [1] and according to the suggestions
> >> from @Xingtong Song, could we add a subsection in `Towards Streaming
> >> Warehouses` or `Performance` that the short-lived query in Flink Session
> >> Cluster is one of the future directions for Flink?
> >>
> >> Best,
> >> Shammon FY
> >>
> >> On Mon, Aug 14, 2023 at 8:03 PM Jark Wu  wrote:
> >>
> >>> Thank you everyone for helping polish the roadmap [1].
> >>>
> >>> I think I have addressed all the comments and we have included all
> >>> ongoing
> >>> parts of Flink.
> >>> Please feel free to take a last look. I'm going to prepare the pull
> >>> request
> >>> if there are no more concerns.
> >>>
> >>> Best,
> >>> Jark
> >>>
> >>> [1]:
> >>>
> >>>
> https://docs.google.com/document/d/12BDiVKEsY-f7HI3suO_IxwzCmR04QcVqLarXgyJAb7c/edit
> >>>
> >>> On Sun, 13 Aug 2023 at 13:04, Yuan Mei  wrote:
> >>>
> >>> > Sorry for taking so long
> >>> >
> >>> > I've added a section about Flink Disaggregated State Management
> >>> Evolution
> >>> > in the attached doc.
> >>> >
> >>> > I found some of the contents might be overlapped with the
> "large-scale
> >>> > streaming jobs". So that part might need some changes as well.
> >>> >
> >>> > Please let me know what you think.
> >>> >
> >>> > Best
> >>> > Yuan
> >>> >
> >>> > On Mon, Jul 24, 2023 at 12:07 PM Yuan Mei 
> >>> wrote:
> >>> >
> >>> > > Sorry have missed this email and respond a bit late.
> >>> > >
> >>> > > I will put a draft for the long-term vision for the state as well
> as
> >>> > > large-scale state support into the roadmap.
> >>> > >
> >>> > > Best
> >>> > > Yuan
> >>> > >
> >>> > > On Mon, Jul 17, 2023 at 10:34 AM Jark Wu  wrote:
> >>> > >
> >>> > >> Hi Jiabao,
> >>> > >>
> >>> > >> Thank you for your suggestions. I have added them to the "Going
> >>> Beyond a
> >>> > >> SQL Stream/Batch Processing Engine" and "Large-Scale State Jobs"
> >>> > sections.
> >>> > >>
> >>> > >> Best,
> >>> > >> Jark
> >>> > >>
> >>> > >> On Thu, 13 Jul 2023 at 16:06, Jiabao Sun  >>> > >> .invalid>
> >>> > >> wrote:
> >>> > >>
> >>> > >> > Thanks Jark and Martijn for driving this.
> >>> > >> >
> >>> > >> > There are two suggestions about the Table API:
> >>> > >> >
> >>> > >> > - Add the JSON type to adapt to the no sql database type.
> >>> > >> > - Remove changelog normalize operator for upsert stream.
> >

Re: [DISCUSS] FLIP-348: Support System Columns in SQL and Table API

2023-08-17 Thread Jark Wu
 >> Looks like both kinds of system columns can converge.
> >> We can say that any operator can introduce system (psedo) columns.
> >>
> >> cc Eugene who is also interested in the subject.
> >>
> >> On Wed, Aug 2, 2023 at 1:03 AM Paul Lam  wrote:
> >>
> >>> Hi Timo,
> >>>
> >>> Thanks for starting the discussion! System columns are no doubt a
> >>> good boost on Flink SQL’s usability, and I see the feedbacks are
> >>> mainly concerns about the accessibility of system columns.
> >>>
> >>> I think most of the concerns could be solved by clarifying the
> >>> ownership of the system columns. Different from databases like
> >>> Oracle/BigQuery/PG who owns the data/metadata, Flink uses the
> >>> data/metadata from external systems. That means Flink could
> >>> have 2 kinds of system columns (take ROWID for example):
> >>>
> >>> 1. system columns provided by external systems via catalogs, such
> >>>  as ROWID from the original system.
> >>> 2. system columns generated by Flink, such as ROWID generated by
> >>>  Flink itself.
> >>>
> >>> IIUC, the FLIP is proposing the 1st approach: the catalog defines what
> >>> system columns to provide, and Flink treats them as normal columns
> >>> with a special naming pattern.
> >>>
> >>> On the other hand, Jark is proposing the 2nd one: the system columns
> >>> are defined and owned by Flink, and can be inferred from external
> >>> systems. Therefore, system columns should be predefined by Flink,
> >>> and optionally implemented by the catalogs.
> >>>
> >>> Personally, I’m in favor of the 2nd approach, because it makes the
> >>> system columns very accessible and more aligned across the catalogs.
> >>>
> >>> BTW, I second Alexey that systems columns should not be shown with
> >>> DESCRIBE statements.
> >>>
> >>> WDYT? Thanks!
> >>>
> >>> Best,
> >>> Paul Lam
> >>>
> >>>> 2023年7月31日 23:54,Jark Wu  写道:
> >>>>
> >>>> Hi Timo,
> >>>>
> >>>> Thanks for your proposal. I think this is a nice feature for users
> >>>> and I
> >>>> prefer option 3.
> >>>>
> >>>> I only have one concern about the concept of pseudo-column or
> >>>> system-column,
> >>>> because this is the first time we introduce it in Flink SQL. The
> >>>> confusion is similar to the
> >>>> question of Benchao and Sergey about the propagation of pseudo-column.
> >>>>
> >>>>  From my understanding, a pseudo-column can be get from an arbitrary
> >>> query,
> >>>> just similar to
> >>>> ROWNUM in Oracle[1], such as :
> >>>>
> >>>> SELECT *
> >>>> FROM (SELECT * FROM employees ORDER BY employee_id)
> >>>> WHERE ROWNUM < 11;
> >>>>
> >>>> However, IIUC, the proposed "$rowtime" pseudo-column can only be got
> >>>> from
> >>>> the physical table
> >>>> and can't be got from queries even if the query propagates the rowtime
> >>>> attribute. There was also
> >>>> a discussion about adding a pseudo-column "_proctime" [2] to make
> >>>> lookup
> >>>> join easier to use
> >>>> which can be got from arbitrary queries. That "_proctime" may conflict
> >>> with
> >>>> the proposed
> >>>> pseudo-column concept.
> >>>>
> >>>> Did you consider making it as a built-in defined pseudo-column
> >>>> "$rowtime"
> >>>> which returns the
> >>>> time attribute value (if exists) or null (if non-exists) for every
> >>>> table/query, and pseudo-column
> >>>> "$proctime" always returns PROCTIME() value for each table/query. In
> >>>> this
> >>>> way, catalogs only need
> >>>> to provide a default rowtime attribute and users can get it in the
> same
> >>>> way. And we don't need
> >>>> to introduce the contract interface of "Metadata Key Prefix
> Constraint"
> >>>> which is still a little complex
> >>>> for users and devs to understand.
> >>>

Re: [DISCUSS] [FLINK-32873] Add a config to allow disabling Query hints

2023-08-17 Thread Jark Wu
Sorry, I still don't understand why we need to disable the query hint.
It doesn't have the security problems as options hint. Bonnie said it
could affect performance, but that depends on users using it explicitly.
If there is any performance problem, users can remove the hint.

If we want to disable query hint just because it's an extension to SQL
standard.
I'm afraid we have to introduce a bunch of configuration, because lots of
the streaming SQL syntax are extensions of SQL standard.

Best,
Jark

On Thu, 17 Aug 2023 at 15:43, Timo Walther  wrote:

> +1 for this proposal.
>
> Not every data team would like to enable hints. Also because they are an
> extension to the SQL standard. It might also be the case that custom
> rules would be overwritten otherwise. Setting hints could also be the
> exclusive task of a DevOp team.
>
> Regards,
> Timo
>
>
> On 17.08.23 09:30, Konstantin Knauf wrote:
> > Hi Bonnie,
> >
> > this makes sense to me, in particular, given that we already have this
> > toggle for a different type of hints.
> >
> > Best,
> >
> > Konstantin
> >
> > Am Mi., 16. Aug. 2023 um 19:38 Uhr schrieb Bonnie Arogyam Varghese
> > :
> >
> >> Hi Liu,
> >>   Options hints could be a security concern since users can override
> >> settings. However, query hints specifically could affect performance.
> >> Since we have a config to disable Options hint, I'm suggesting we also
> have
> >> a config to disable Query hints.
> >>
> >> On Wed, Aug 16, 2023 at 9:41 AM liu ron  wrote:
> >>
> >>> Hi,
> >>>
> >>> Thanks for driving this proposal.
> >>>
> >>> Can you explain why you would need to disable query hints because of
> >>> security issues? I don't really understand why query hints affects
> >>> security.
> >>>
> >>> Best,
> >>> Ron
> >>>
> >>> Bonnie Arogyam Varghese  于2023年8月16日周三
> >>> 23:59写道:
> >>>
>  Platform providers may want to disable hints completely for security
>  reasons.
> 
>  Currently, there is a configuration to disable OPTIONS hint -
> 
> 
> >>>
> >>
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/config/#table-dynamic-table-options-enabled
> 
>  However, there is no configuration available to disable QUERY hints -
> 
> 
> >>>
> >>
> https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/dev/table/sql/queries/hints/#query-hints
> 
>  The proposal is to add a new configuration:
> 
>  Name: table.query-options.enabled
>  Description: Enable or disable the QUERY hint, if disabled, an
>  exception would be thrown if any QUERY hints are specified
>  Note: The default value will be set to true.
> 
> >>>
> >>
> >
> >
>
>


Re: [DISCUSS] Update Flink Roadmap

2023-08-14 Thread Jark Wu
Hi Shammon,

Sure, could you help to draft a subsection about this in the google doc?

Best,
Jark

> 2023年8月14日 20:30,Shammon FY  写道:
> 
> Thanks @Jark for driving the Flink Roadmap. 
> 
> As we discussed olap in the thread [1] and according to the suggestions from 
> @Xingtong Song, could we add a subsection in `Towards Streaming Warehouses` 
> or `Performance` that the short-lived query in Flink Session Cluster is one 
> of the future directions for Flink?
> 
> Best,
> Shammon FY
> 
> On Mon, Aug 14, 2023 at 8:03 PM Jark Wu  <mailto:imj...@gmail.com>> wrote:
>> Thank you everyone for helping polish the roadmap [1].
>> 
>> I think I have addressed all the comments and we have included all ongoing
>> parts of Flink.
>> Please feel free to take a last look. I'm going to prepare the pull request
>> if there are no more concerns.
>> 
>> Best,
>> Jark
>> 
>> [1]:
>> https://docs.google.com/document/d/12BDiVKEsY-f7HI3suO_IxwzCmR04QcVqLarXgyJAb7c/edit
>> 
>> On Sun, 13 Aug 2023 at 13:04, Yuan Mei > <mailto:yuanmei.w...@gmail.com>> wrote:
>> 
>> > Sorry for taking so long
>> >
>> > I've added a section about Flink Disaggregated State Management Evolution
>> > in the attached doc.
>> >
>> > I found some of the contents might be overlapped with the "large-scale
>> > streaming jobs". So that part might need some changes as well.
>> >
>> > Please let me know what you think.
>> >
>> > Best
>> > Yuan
>> >
>> > On Mon, Jul 24, 2023 at 12:07 PM Yuan Mei > > <mailto:yuanmei.w...@gmail.com>> wrote:
>> >
>> > > Sorry have missed this email and respond a bit late.
>> > >
>> > > I will put a draft for the long-term vision for the state as well as
>> > > large-scale state support into the roadmap.
>> > >
>> > > Best
>> > > Yuan
>> > >
>> > > On Mon, Jul 17, 2023 at 10:34 AM Jark Wu > > > <mailto:imj...@gmail.com>> wrote:
>> > >
>> > >> Hi Jiabao,
>> > >>
>> > >> Thank you for your suggestions. I have added them to the "Going Beyond a
>> > >> SQL Stream/Batch Processing Engine" and "Large-Scale State Jobs"
>> > sections.
>> > >>
>> > >> Best,
>> > >> Jark
>> > >>
>> > >> On Thu, 13 Jul 2023 at 16:06, Jiabao Sun > > >> <mailto:jiabao@xtransfer.cn>
>> > >> .invalid>
>> > >> wrote:
>> > >>
>> > >> > Thanks Jark and Martijn for driving this.
>> > >> >
>> > >> > There are two suggestions about the Table API:
>> > >> >
>> > >> > - Add the JSON type to adapt to the no sql database type.
>> > >> > - Remove changelog normalize operator for upsert stream.
>> > >> >
>> > >> >
>> > >> > Best,
>> > >> > Jiabao
>> > >> >
>> > >> >
>> > >> > > 2023年7月13日 下午3:49,Jark Wu > > >> > > <mailto:imj...@gmail.com>> 写道:
>> > >> > >
>> > >> > > Hi all,
>> > >> > >
>> > >> > > Sorry for taking so long back here.
>> > >> > >
>> > >> > > Martijn and I have drafted the first version of the updated roadmap,
>> > >> > > including the updated feature radar reflecting the current state of
>> > >> > > different components.
>> > >> > >
>> > >> >
>> > >>
>> > https://docs.google.com/document/d/12BDiVKEsY-f7HI3suO_IxwzCmR04QcVqLarXgyJAb7c/edit
>> > >> > >
>> > >> > > Feel free to leave comments in the thread or the document.
>> > >> > > We may miss mentioning something important, so your help in
>> > enriching
>> > >> > > the content is greatly appreciated.
>> > >> > >
>> > >> > > Best,
>> > >> > > Jark & Martijn
>> > >> > >
>> > >> > >
>> > >> > > On Fri, 2 Jun 2023 at 00:50, Jing Ge 
>> > >> wrote:
>> > >> > >
>> > >> > >> Hi Jark,
>> > >> > >>
>> > >> > >> Fair enough. Le

Re: [DISCUSS] FLIP-330: Support specifying record timestamp requirement

2023-08-14 Thread Jark Wu
Hi Becket,

> I kind of think that we can
restrain the scope to just batch mode, and only for StreamRecord class.
That means only in batch mode, the timestamp in the StreamRecord will be
dropped when the config is enabled.

However, IIUC, dropping timestamp in StreamRecord has been supported.
This is an existing optimization in StreamElementSerializer that the 8bytes
of
the timestamp is not serialized if there is no timestamp on the
StreamRecord.

-

Reducing 1-byte of StreamElement tag is a good idea to improve performance.
But I agree with Xintong and Gyula that we should have a balance between
complexity and performance. I'm fine to introduce this optimization if only
for
pure batch SQL. Because this is the only way (not even batch DataStream
and batch Table API) to enable it by default. But I have concerns about
other options.

The largest concern from my side is it exposing a configuration to users
which
is hard to understand and afraid to enable and not worth enabling it. If
users
rarely enable this configuration, this would be an overhead to maintain for
the community without benefits.

Besides, I suspect whether we can remove "pipeline.force-timestamp-support"
in the future. From my understanding, it is pretty hard for the framework
to detect
whether the job does not have a watermark strategy. Because the watermark
may be assigned in any operators by using Output#emitWatermark.

Best,
Jark


On Sat, 12 Aug 2023 at 13:23, Gyula Fóra  wrote:

> Hey Devs,
>
> It would be great to see some other benchmarks ,  not only the dummy
> WordCount example.
>
> I would love to see a few SQL queries documented and whether there is any
> measurable benefit at all.
>
> Prod pipelines usually have some IO component etc which will add enough
> overhead to make this even less noticeable. I agree that even small
> improvements are worthwhile but they should be observable/significant on
> real workloads. Otherwise complicating the runtime layer, types and configs
> are not worth it in my opinion.
>
> Cheers
> Gyula
>
> On Sat, 12 Aug 2023 at 04:39, Becket Qin  wrote:
>
> > Thanks for the FLIP, Yunfeng.
> >
> > I had a brief offline discussion with Dong, and here are my two cents:
> >
> > ## The benefit
> > The FLIP is related to one of the perf benchmarks we saw at LinkedIn
> which
> > is pretty much doing a word count, except that the words are country
> code,
> > so it is typically just two bytes, e.g. CN, US, UK. What I see is that
> the
> > amount of data going through shuffle is much higher in Flink  DataStream
> > batch mode compared with the Flink DataSet API. And in this case, because
> > the actual key is just 2 bytes so the overhead is kind of high. In batch
> > processing, it is not rare that people first tokenize the data before
> > processing to save cost. For example, imagine in word count the words are
> > coded as 4-byte Integers instead of String. So the 1 byte overhead can
> > still introduce 25 percent of the overhead. Therefore, I think the
> > optimization in the FLIP can still benefit a bunch of batch processing
> > cases. For streaming, the benefit still applies, although less compared
> > with batch.
> >
> > ## The complexity and long term solution
> > In terms of the complexity of the FLIP. I kind of think that we can
> > restrain the scope to just batch mode, and only for StreamRecord class.
> > That means only in batch mode, the timestamp in the StreamRecord will be
> > dropped when the config is enabled. This should give the most of the
> > benefit while significantly reducing the complexity of the FLIP.
> > In practice, I think people rarely use StreamRecord timestamps in batch
> > jobs. But because this is not an explicit API contract for users, from
> what
> > I understand, the configuration is introduced to make it 100% safe for
> the
> > users. In another word, we won't need this configuration if our contract
> > with users does not support timestamps in batch mode. In order to make
> the
> > contract clear, maybe we can print a warning if the timestamp field in
> > StreamRecord is accessed in batch mode starting from the next release. So
> > we can drop the configuration completely in 2.0. By then, Flink should
> have
> > enough information to determine whether timestamps in StreamRecords
> should
> > be supported for a job/operator or not, e.g. batch mode, processing time
> > only jobs, etc.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> > On Fri, Aug 11, 2023 at 9:46 PM Dong Lin  wrote:
> >
> > > Hi Xintong,
> > >
> > > Thanks for the quick reply. I also agree that we should hear from
> others
> > > about
> > > whether this optimization is worthwhile.
> > >
> > > Please see my comments inline.
> > >
> > > On Fri, Aug 11, 2023 at 5:54 PM Xintong Song 
> > > wrote:
> > >
> > > > Thanks for the quick replies.
> > > >
> > > > Overall, it seems that the main concern with this FLIP is that the 2%
> > > > > throughput saving might not

Re: [DISCUSS] Update Flink Roadmap

2023-08-14 Thread Jark Wu
Thank you everyone for helping polish the roadmap [1].

I think I have addressed all the comments and we have included all ongoing
parts of Flink.
Please feel free to take a last look. I'm going to prepare the pull request
if there are no more concerns.

Best,
Jark

[1]:
https://docs.google.com/document/d/12BDiVKEsY-f7HI3suO_IxwzCmR04QcVqLarXgyJAb7c/edit

On Sun, 13 Aug 2023 at 13:04, Yuan Mei  wrote:

> Sorry for taking so long
>
> I've added a section about Flink Disaggregated State Management Evolution
> in the attached doc.
>
> I found some of the contents might be overlapped with the "large-scale
> streaming jobs". So that part might need some changes as well.
>
> Please let me know what you think.
>
> Best
> Yuan
>
> On Mon, Jul 24, 2023 at 12:07 PM Yuan Mei  wrote:
>
> > Sorry have missed this email and respond a bit late.
> >
> > I will put a draft for the long-term vision for the state as well as
> > large-scale state support into the roadmap.
> >
> > Best
> > Yuan
> >
> > On Mon, Jul 17, 2023 at 10:34 AM Jark Wu  wrote:
> >
> >> Hi Jiabao,
> >>
> >> Thank you for your suggestions. I have added them to the "Going Beyond a
> >> SQL Stream/Batch Processing Engine" and "Large-Scale State Jobs"
> sections.
> >>
> >> Best,
> >> Jark
> >>
> >> On Thu, 13 Jul 2023 at 16:06, Jiabao Sun  >> .invalid>
> >> wrote:
> >>
> >> > Thanks Jark and Martijn for driving this.
> >> >
> >> > There are two suggestions about the Table API:
> >> >
> >> > - Add the JSON type to adapt to the no sql database type.
> >> > - Remove changelog normalize operator for upsert stream.
> >> >
> >> >
> >> > Best,
> >> > Jiabao
> >> >
> >> >
> >> > > 2023年7月13日 下午3:49,Jark Wu  写道:
> >> > >
> >> > > Hi all,
> >> > >
> >> > > Sorry for taking so long back here.
> >> > >
> >> > > Martijn and I have drafted the first version of the updated roadmap,
> >> > > including the updated feature radar reflecting the current state of
> >> > > different components.
> >> > >
> >> >
> >>
> https://docs.google.com/document/d/12BDiVKEsY-f7HI3suO_IxwzCmR04QcVqLarXgyJAb7c/edit
> >> > >
> >> > > Feel free to leave comments in the thread or the document.
> >> > > We may miss mentioning something important, so your help in
> enriching
> >> > > the content is greatly appreciated.
> >> > >
> >> > > Best,
> >> > > Jark & Martijn
> >> > >
> >> > >
> >> > > On Fri, 2 Jun 2023 at 00:50, Jing Ge 
> >> wrote:
> >> > >
> >> > >> Hi Jark,
> >> > >>
> >> > >> Fair enough. Let's do it like you suggested. Thanks!
> >> > >>
> >> > >> Best regards,
> >> > >> Jing
> >> > >>
> >> > >> On Thu, Jun 1, 2023 at 6:00 PM Jark Wu  wrote:
> >> > >>
> >> > >>> Hi Jing,
> >> > >>>
> >> > >>> This thread is for discussing the roadmap for versions 1.18, 2.0,
> >> and
> >> > >> even
> >> > >>> more.
> >> > >>> One of the outcomes of this discussion will be an updated version
> of
> >> > the
> >> > >>> current roadmap.
> >> > >>> Let's work together on refining the roadmap in this thread.
> >> > >>>
> >> > >>> Best,
> >> > >>> Jark
> >> > >>>
> >> > >>> On Thu, 1 Jun 2023 at 23:25, Jing Ge 
> >> > wrote:
> >> > >>>
> >> > >>>> Hi Jark,
> >> > >>>>
> >> > >>>> Thanks for driving it! For point 2, since we are developing 1.18
> >> now,
> >> > >>>> does it make sense to update the roadmap this time while we are
> >> > >> releasing
> >> > >>>> 1.18? This discussion thread will be focusing on the Flink 2.0
> >> > roadmap,
> >> > >>> as
> >> > >>>> you mentioned previously. WDYT?
> >> > >>>>
> >> > >>>> Best regard

Re: [ANNOUNCE] New Apache Flink Committer - Weihua Hu

2023-08-04 Thread Jark Wu
Congratulations, Weihua!

Best,
Jark

On Fri, 4 Aug 2023 at 14:48, Yuxin Tan  wrote:

> Congratulations Weihua!
>
> Best,
> Yuxin
>
>
> Junrui Lee  于2023年8月4日周五 14:28写道:
>
> > Congrats, Weihua!
> > Best,
> > Junrui
> >
> > Geng Biao  于2023年8月4日周五 14:25写道:
> >
> > > Congrats, Weihua!
> > > Best,
> > > Biao Geng
> > >
> > > 发送自 Outlook for iOS
> > > 
> > > 发件人: 周仁祥 
> > > 发送时间: Friday, August 4, 2023 2:23:42 PM
> > > 收件人: dev@flink.apache.org 
> > > 抄送: Weihua Hu 
> > > 主题: Re: [ANNOUNCE] New Apache Flink Committer - Weihua Hu
> > >
> > > Congratulations, Weihua~
> > >
> > > > 2023年8月4日 14:21,Sergey Nuyanzin  写道:
> > > >
> > > > Congratulations, Weihua!
> > > >
> > > > On Fri, Aug 4, 2023 at 8:03 AM Chen Zhanghao <
> > zhanghao.c...@outlook.com>
> > > > wrote:
> > > >
> > > >> Congratulations, Weihua!
> > > >>
> > > >> Best,
> > > >> Zhanghao Chen
> > > >> 
> > > >> 发件人: Xintong Song 
> > > >> 发送时间: 2023年8月4日 11:18
> > > >> 收件人: dev 
> > > >> 抄送: Weihua Hu 
> > > >> 主题: [ANNOUNCE] New Apache Flink Committer - Weihua Hu
> > > >>
> > > >> Hi everyone,
> > > >>
> > > >> On behalf of the PMC, I'm very happy to announce Weihua Hu as a new
> > > Flink
> > > >> Committer!
> > > >>
> > > >> Weihua has been consistently contributing to the project since May
> > > 2022. He
> > > >> mainly works in Flink's distributed coordination areas. He is the
> main
> > > >> contributor of FLIP-298 and many other improvements in large-scale
> job
> > > >> scheduling and improvements. He is also quite active in mailing
> lists,
> > > >> participating discussions and answering user questions.
> > > >>
> > > >> Please join me in congratulating Weihua!
> > > >>
> > > >> Best,
> > > >>
> > > >> Xintong (on behalf of the Apache Flink PMC)
> > > >>
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > > Sergey
> > >
> > >
> >
>


Re: [ANNOUNCE] New Apache Flink PMC Member - Matthias Pohl

2023-08-04 Thread Jark Wu
Congratulations, Matthias!

Best,
Jark

On Fri, 4 Aug 2023 at 14:59, Weihua Hu  wrote:

> Congratulations,  Matthias!
>
> Best,
> Weihua
>
>
> On Fri, Aug 4, 2023 at 2:49 PM Yuxin Tan  wrote:
>
> > Congratulations, Matthias!
> >
> > Best,
> > Yuxin
> >
> >
> > Sergey Nuyanzin  于2023年8月4日周五 14:21写道:
> >
> > > Congratulations, Matthias!
> > > Well deserved!
> > >
> > > On Fri, Aug 4, 2023 at 7:59 AM liu ron  wrote:
> > >
> > > > Congrats, Matthias!
> > > >
> > > > Best,
> > > > Ron
> > > >
> > > > Shammon FY  于2023年8月4日周五 13:24写道:
> > > >
> > > > > Congratulations, Matthias!
> > > > >
> > > > > On Fri, Aug 4, 2023 at 1:13 PM Samrat Deb 
> > > wrote:
> > > > >
> > > > > > Congrats, Matthias!
> > > > > >
> > > > > >
> > > > > > On Fri, 4 Aug 2023 at 10:13 AM, Benchao Li  >
> > > > wrote:
> > > > > >
> > > > > > > Congratulations, Matthias!
> > > > > > >
> > > > > > > Jing Ge  于2023年8月4日周五 12:35写道:
> > > > > > >
> > > > > > > > Congrats! Matthias!
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > > Jing
> > > > > > > >
> > > > > > > > On Fri, Aug 4, 2023 at 12:09 PM Yangze Guo <
> karma...@gmail.com
> > >
> > > > > wrote:
> > > > > > > >
> > > > > > > > > Congrats, Matthias!
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Yangze Guo
> > > > > > > > >
> > > > > > > > > On Fri, Aug 4, 2023 at 11:44 AM Qingsheng Ren <
> > > re...@apache.org>
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Congratulations, Matthias! This is absolutely well
> > deserved.
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Qingsheng
> > > > > > > > > >
> > > > > > > > > > On Fri, Aug 4, 2023 at 11:31 AM Rui Fan <
> > > 1996fan...@gmail.com>
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Congratulations Matthias, well deserved!
> > > > > > > > > > >
> > > > > > > > > > > Best,
> > > > > > > > > > > Rui Fan
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Aug 4, 2023 at 11:30 AM Leonard Xu <
> > > > xbjt...@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Congratulations,  Matthias.
> > > > > > > > > > > >
> > > > > > > > > > > > Well deserved ^_^
> > > > > > > > > > > >
> > > > > > > > > > > > Best,
> > > > > > > > > > > > Leonard
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > On Aug 4, 2023, at 11:18 AM, Xintong Song <
> > > > > > > tonysong...@gmail.com
> > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hi everyone,
> > > > > > > > > > > > >
> > > > > > > > > > > > > On behalf of the PMC, I'm very happy to announce
> that
> > > > > > Matthias
> > > > > > > > > Pohl has
> > > > > > > > > > > > > joined the Flink PMC!
> > > > > > > > > > > > >
> > > > > > > > > > > > > Matthias has been consistently contributing to the
> > > > project
> > > > > > > since
> > > > > > > > > Sep
> > > > > > > > > > > > 2020,
> > > > > > > > > > > > > and became a committer in Dec 2021. He mainly works
> > in
> > > > > > Flink's
> > > > > > > > > > > > distributed
> > > > > > > > > > > > > coordination and high availability areas. He has
> > worked
> > > > on
> > > > > > many
> > > > > > > > > FLIPs
> > > > > > > > > > > > > including FLIP195/270/285. He helped a lot with the
> > > > release
> > > > > > > > > management,
> > > > > > > > > > > > > being one of the Flink 1.17 release managers and
> also
> > > > very
> > > > > > > active
> > > > > > > > > in
> > > > > > > > > > > > Flink
> > > > > > > > > > > > > 1.18 / 2.0 efforts. He also contributed a lot to
> > > > improving
> > > > > > the
> > > > > > > > > build
> > > > > > > > > > > > > stability.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Please join me in congratulating Matthias!
> > > > > > > > > > > > >
> > > > > > > > > > > > > Best,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Xintong (on behalf of the Apache Flink PMC)
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > >
> > > > > > > Best,
> > > > > > > Benchao Li
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Best regards,
> > > Sergey
> > >
> >
>


Re: [ANNOUNCE] New Apache Flink Committer - Hong Teoh

2023-08-04 Thread Jark Wu
Congratulations, Hong!

Best,
Jark

On Fri, 4 Aug 2023 at 14:24, Sergey Nuyanzin  wrote:

> Congratulations, Hong!
>
> On Fri, Aug 4, 2023 at 7:25 AM Shammon FY  wrote:
>
> > Congratulations, Hong!
> >
> > Best,
> > Shammon FY
> >
> > On Fri, Aug 4, 2023 at 12:33 PM Jing Ge 
> > wrote:
> >
> > > congrats! Hong!
> > >
> > > Best regards,
> > > Jing
> > >
> > > On Fri, Aug 4, 2023 at 11:48 AM Qingsheng Ren 
> > wrote:
> > >
> > > > Congratulations and welcome aboard, Hong!
> > > >
> > > > Best,
> > > > Qingsheng
> > > >
> > > > On Fri, Aug 4, 2023 at 11:04 AM Matt Wang  wrote:
> > > >
> > > > > Congratulations, Hong!
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > Best,
> > > > > Matt Wang
> > > > >
> > > > >
> > > > >  Replied Message 
> > > > > | From | Weihua Hu |
> > > > > | Date | 08/4/2023 10:55 |
> > > > > | To |  |
> > > > > | Subject | Re: [ANNOUNCE] New Apache Flink Committer - Hong Teoh |
> > > > > Congratulations, Hong!
> > > > >
> > > > > Best,
> > > > > Weihua
> > > > >
> > > > >
> > > > > On Fri, Aug 4, 2023 at 10:49 AM Samrat Deb 
> > > > wrote:
> > > > >
> > > > > Congratulations, Hong Teoh
> > > > >
> > > > > On Fri, 4 Aug 2023 at 7:52 AM, Benchao Li 
> > > wrote:
> > > > >
> > > > > Congratulations, Hong!
> > > > >
> > > > > yuxia  于2023年8月4日周五 09:23写道:
> > > > >
> > > > > Congratulations, Hong Teoh!
> > > > >
> > > > > Best regards,
> > > > > Yuxia
> > > > >
> > > > > - 原始邮件 -
> > > > > 发件人: "Matthias Pohl" 
> > > > > 收件人: "dev" 
> > > > > 发送时间: 星期四, 2023年 8 月 03日 下午 11:24:39
> > > > > 主题: Re: [ANNOUNCE] New Apache Flink Committer - Hong Teoh
> > > > >
> > > > > Congratulations, Hong! :)
> > > > >
> > > > > On Thu, Aug 3, 2023 at 3:39 PM Leonard Xu 
> wrote:
> > > > >
> > > > > Congratulations, Hong!
> > > > >
> > > > >
> > > > > Best,
> > > > > Leonard
> > > > >
> > > > > On Aug 3, 2023, at 8:45 PM, Jiabao Sun  > > > > .INVALID>
> > > > > wrote:
> > > > >
> > > > > Congratulations, Hong Teoh!
> > > > >
> > > > > Best,
> > > > > Jiabao Sun
> > > > >
> > > > > 2023年8月3日 下午7:32,Danny Cranmer  写道:
> > > > >
> > > > > On behalf of the PMC, I'm very happy to announce Hong Teoh as a
> > > > > new
> > > > > Flink
> > > > > Committer.
> > > > >
> > > > > Hong has been active in the Flink community for over 1 year and
> > > > > has
> > > > > played
> > > > > a key role in developing and maintaining AWS integrations, core
> > > > > connector
> > > > > APIs and more recently, improvements to the Flink REST API.
> > > > > Additionally,
> > > > > Hong is a very active community member, supporting users and
> > > > > participating
> > > > > in discussions on the mailing lists, Flink slack channels and
> > > > > speaking
> > > > > at
> > > > > conferences.
> > > > >
> > > > > Please join me in congratulating Hong for becoming a Flink
> > > > > Committer!
> > > > >
> > > > > Thanks,
> > > > > Danny Cranmer (on behalf of the Flink PMC)
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > Best,
> > > > > Benchao Li
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>
>
> --
> Best regards,
> Sergey
>


Re: FLINK-20767 - Support for nested fields filter push down

2023-08-02 Thread Jark Wu
Hi,

I agree with Becket that we may need to extend FieldReferenceExpression to
support nested field access (or maybe a new
NestedFieldReferenceExpression).
But I have some concerns about evolving the
SupportsProjectionPushDown.applyProjection.
A projection is much simpler than Filter Expression which only needs to
represent the field indexes.
If we evolve `applyProjection` to accept `List
projectedFields`,
users have to convert the `List` back to int[][]
which is an overhead for users.
Field indexes (int[][]) is required to project schemas with the
utility org.apache.flink.table.connector.Projection.


Best,
Jark



On Wed, 2 Aug 2023 at 07:40, Venkatakrishnan Sowrirajan 
wrote:

> Thanks Becket for the suggestion. That makes sense. Let me try it out and
> get back to you.
>
> Regards
> Venkata krishnan
>
>
> On Tue, Aug 1, 2023 at 9:04 AM Becket Qin  wrote:
>
> > This is a very useful feature in practice.
> >
> > It looks to me that the key issue here is that Flink ResolvedExpression
> > does not have necessary abstraction for nested field access. So the
> Calcite
> > RexFieldAccess does not have a counterpart in the ResolvedExpression. The
> > FieldReferenceExpression only supports direct access to the fields, not
> > nested access.
> >
> > Theoretically speaking, this nested field reference is also required by
> > projection pushdown. However, we addressed that by using an int[][] in
> the
> > SupportsProjectionPushDown interface. Maybe we can do the following:
> >
> > 1. Extend the FieldReferenceExpression to include an int[] for nested
> field
> > access,
> > 2. By doing (1),
> > SupportsFilterPushDown#applyFilters(List) can support
> > nested field access.
> > 3. Evolve the SupportsProjectionPushDown.applyProjection(int[][]
> > projectedFields, DataType producedDataType) to
> > applyProjection(List projectedFields, DataType
> > producedDataType)
> >
> > This will need a FLIP.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Tue, Aug 1, 2023 at 11:42 PM Venkatakrishnan Sowrirajan <
> > vsowr...@asu.edu>
> > wrote:
> >
> > > Thanks for the response. Looking forward to your pointers. In the
> > > meanwhile, let me figure out how we can implement it. Will keep you
> > posted.
> > >
> > > On Mon, Jul 31, 2023, 11:43 PM liu ron  wrote:
> > >
> > > > Hi, Venkata
> > > >
> > > > Thanks for reporting this issue. Currently, Flink doesn't support
> > nested
> > > > filter pushdown. I also think that this optimization would be useful,
> > > > especially for jobs, which may need to read a lot of data from the
> > > parquet
> > > > or orc file. We didn't move forward with this for some priority
> > reasons.
> > > >
> > > > Regarding your three questions, I will respond to you later after my
> > > > on-call is finished because I need to dive into the source code.
> About
> > > your
> > > > commit, I don't think it's the right solution because
> > > > FieldReferenceExpression doesn't currently support nested field
> filter
> > > > pushdown, maybe we need to extend it.
> > > >
> > > > You can also look further into reasonable solutions, which we'll
> > discuss
> > > > further later on.
> > > >
> > > > Best,
> > > > Ron
> > > >
> > > >
> > > > Venkatakrishnan Sowrirajan  于2023年7月29日周六 03:31写道:
> > > >
> > > > > Hi all,
> > > > >
> > > > > Currently, I am working on adding support for nested fields filter
> > push
> > > > > down. In our use case running Flink on Batch, we found nested
> fields
> > > > filter
> > > > > push down is key - without it, it is significantly slow. Note:
> Spark
> > > SQL
> > > > > supports nested fields filter push down.
> > > > >
> > > > > While debugging the code using IcebergTableSource as the table
> > source,
> > > > > narrowed down the issue to missing support for
> > > > > RexNodeExtractor#RexNodeToExpressionConverter#visitFieldAccess.
> > > > > As part of fixing it, I made changes by returning an
> > > > > Option(FieldReferenceExpression)
> > > > > with appropriate reference to the parent index and the child index
> > for
> > > > the
> > > > > nested field with the data type info.
> > > > >
> > > > > But this new ResolvedExpression cannot be converted to RexNode
> which
> > > > > happens in PushFilterIntoSourceScanRuleBase
> > > > > <
> > > > >
> > > >
> > >
> >
> https://urldefense.com/v3/__https://github.com/apache/flink/blob/3f63e03e83144e9857834f8db1895637d2aa218a/flink-table/flink-table-planner/src/main/java/org/apache/flink/table/planner/plan/rules/logical/PushFilterIntoSourceScanRuleBase.java*L104__;Iw!!IKRxdwAv5BmarQ!fNgxcul8ZGwkNE9ygOeVGlWlU6m_MLMXf4A3S3oQu9LBzYTPF90pZ7uXSGMr-5dFmzRn37-e9Q5cMnVs$
> > > > > >
> > > > > .
> > > > >
> > > > > Few questions
> > > > >
> > > > > 1. Does FieldReferenceExpression support nested fields currently or
> > > > should
> > > > > it be extended to support nested fields? I couldn't figure this out
> > > from
> > > > > the PushProjectIntoTableScanRule that supports nested column
> > projection
> > > > > push down.
> > > > > 2. Express

Re: [DISCUSS] FLIP-348: Support System Columns in SQL and Table API

2023-07-31 Thread Jark Wu
Hi Timo,

Thanks for your proposal. I think this is a nice feature for users and I
prefer option 3.

I only have one concern about the concept of pseudo-column or
system-column,
because this is the first time we introduce it in Flink SQL. The
confusion is similar to the
question of Benchao and Sergey about the propagation of pseudo-column.

>From my understanding, a pseudo-column can be get from an arbitrary query,
just similar to
ROWNUM in Oracle[1], such as :

SELECT *
FROM (SELECT * FROM employees ORDER BY employee_id)
WHERE ROWNUM < 11;

However, IIUC, the proposed "$rowtime" pseudo-column can only be got from
the physical table
and can't be got from queries even if the query propagates the rowtime
attribute. There was also
a discussion about adding a pseudo-column "_proctime" [2] to make lookup
join easier to use
which can be got from arbitrary queries. That "_proctime" may conflict with
the proposed
pseudo-column concept.

Did you consider making it as a built-in defined pseudo-column "$rowtime"
which returns the
time attribute value (if exists) or null (if non-exists) for every
table/query, and pseudo-column
 "$proctime" always returns PROCTIME() value for each table/query. In this
way, catalogs only need
 to provide a default rowtime attribute and users can get it in the same
way. And we don't need
to introduce the contract interface of "Metadata Key Prefix Constraint"
which is still a little complex
 for users and devs to understand.

Best,
Jark

[1]:
https://docs.oracle.com/cd/E11882_01/server.112/e41084/pseudocolumns009.htm#SQLRF00255
[2]: https://lists.apache.org/thread/7ln106qxyw8sp7ljq40hs2p1lb1gdwj5




On Fri, 28 Jul 2023 at 06:18, Alexey Leonov-Vendrovskiy <
vendrov...@gmail.com> wrote:

> >
> > `SELECT * FROM (SELECT $rowtime, * FROM t);`
> > Am I right that it will show `$rowtime` in output ?
>
>
> Yes, all explicitly selected columns become a part of the result (and
> intermediate) schema, and hence propagate.
>
> On Thu, Jul 27, 2023 at 2:40 PM Alexey Leonov-Vendrovskiy <
> vendrov...@gmail.com> wrote:
>
> > Thank you, Timo, for starting this FLIP!
> >
> > I propose the following change:
> >
> > Remove the requirement that DESCRIBE need to show system columns.
> >
> >
> > Some concrete vendor specific catalog implementations might prefer this
> > approach.
> > Usually the same system columns are available on all (or family) of
> > tables, and it can be easily captured in the documentation.
> >
> > For example, BigQuery does exactly this: there, pseudo-columns do not
> show
> > up in the table schema in any place, but can be accessed via reference.
> >
> > So I propose we:
> > a) Either we say that DESCRIBE doesn't show system columns,
> > b) Or leave this vendor-specific / or configurable via flag (if needed).
> >
> > Regards,
> > Alexey
> >
> > On Thu, Jul 27, 2023 at 3:27 AM Sergey Nuyanzin 
> > wrote:
> >
> >> Hi Timo,
> >>
> >> Thanks for the FLIP.
> >> I also tend to think that Option 3 is better.
> >>
> >> I would be also interested in a question mentioned by Benchao Li.
> >> And a similar question about nested queries like
> >> `SELECT * FROM (SELECT $rowtime, * FROM t);`
> >> Am I right that it will show `$rowtime` in output ?
> >>
> >>
> >> On Thu, Jul 27, 2023 at 6:58 AM Benchao Li 
> wrote:
> >>
> >> > Hi Timo,
> >> >
> >> > Thanks for the FLIP, I also like the idea and option 3 sounds good to
> >> me.
> >> >
> >> > I would like to discuss a case which is not mentioned in the current
> >> FLIP.
> >> > How are the "System column"s expressed in intermediate result, e.g.
> >> Join?
> >> > E.g. `SELECT * FROM t1 JOIN t2`, I guess it should not include "system
> >> > columns" from t1 and t2 as you proposed, and for `SELECT t1.$rowtime,
> *
> >> > FROM t1 JOIN t2`, it should also be valid.
> >> > Then the question is how to you plan to implement the "system
> columns",
> >> do
> >> > we need to add it to `RelNode` level? Or we just need to do it in the
> >> > parsing/validating phase?
> >> > I'm not sure that Calcite's "system column" feature is fully ready for
> >> this
> >> > since the code about this part is imported from the earlier project
> >> before
> >> > it gets into Apache, and has not been considered much in the past
> >> > development.
> >> >
> >> >
> >> > Jing Ge  于2023年7月26日周三 00:01写道:
> >> >
> >> > > Hi Timo,
> >> > >
> >> > > Thanks for your proposal. It is a very pragmatic feature. Among all
> >> > options
> >> > > in the FLIP, option 3 is one I prefer too and I'd like to ask some
> >> > > questions to understand your thoughts.
> >> > >
> >> > > 1. I did some research on pseudo columns, just out of curiosity, do
> >> you
> >> > > know why most SQL systems do not need any prefix with their pseudo
> >> > column?
> >> > > 2. Some platform providers will use ${variable_name} to define their
> >> own
> >> > > configurations and allow them to be embedded into SQL scripts. Will
> >> there
> >> > > be any conflict with option 3?
> >> > >
> >> > > Best regards,
> >> > > Ji

Re: [DISCUSS] FLIP-330: Support specifying record timestamp requirement

2023-07-31 Thread Jark Wu
Hi Yunfeng,

I think this is a great idea to improve the serialization performance,
especially for batch jobs.
I'm not sure whether you have considered or tested this optimization for
batch jobs.
IMO, this optimization can be enabled by default for batch jobs, because
they don't have watermarks
and don't need latency marker (batch job doesn't care about latency but
throughput).
I'm also very much looking forward to the benchmark result of TPC-DS (batch
mode)!

On the other hand, I'm also very curious about performance improvement.
According to your analysis,
the performance improvement mainly comes from the 1-byte serialization
reduction.
The POC shows a 20% improvement which is amazing. However, I noticed this
POC is
not representative enough, because the record type is a simplest "boolean"
type which
means half serialization can be reduced. However, a real row of data won't
be a simple
boolean type, but with different types of 100B or 1KB. That means the 20%
is a max
theoretical improvement. I'd rather see some benchmark results of actual
workloads,
for example, TPC-DS, Nexmark[1], or even a WordCount job[2]. Could you help
to verify those
workloads?

Best,
Jark

[1]: https://github.com/nexmark/nexmark
[2]:
https://github.com/apache/flink/blob/master/flink-examples/flink-examples-streaming/src/main/java/org/apache/flink/streaming/examples/wordcount/WordCount.java



On Mon, 17 Jul 2023 at 23:59, Jing Ge  wrote:

> Hi Yunfeng,
>
> Thanks for your clarification. It might make sense to add one more
> performance test with object-reuse disabled alongside to let us know how
> big the improvement will be.
>
> Best regards,
> Jing
>
> On Thu, Jul 13, 2023 at 11:51 AM Matt Wang  wrote:
>
> > Hi Yunfeng,
> >
> > Thanks for the proposal. The POC showed a performance improvement of 20%,
> > which is very exciting. But I have some questions:
> > 1. Is the performance improvement here mainly due to the reduction of
> > serialization, or is it due to the judgment consumption caused by tags?
> > 2. Watermark is not needed in some scenarios, but the latency maker is a
> > useful function. If the latency maker cannot be used, it will greatly
> limit
> > the usage scenarios. Whether the solution design can retain the
> capability
> > of the latency marker;
> > 3. The data of the POC test is of long type. Here I want to see how much
> > profit it will have if it is a string with a length of 100B or 1KB.
> >
> >
> > --
> >
> > Best,
> > Matt Wang
> >
> >
> >  Replied Message 
> > | From | Yunfeng Zhou |
> > | Date | 07/13/2023 14:52 |
> > | To |  |
> > | Subject | Re: [DISCUSS] FLIP-330: Support specifying record timestamp
> > requirement |
> > Hi Jing,
> >
> > Thanks for reviewing this FLIP.
> >
> > 1. I did change the names of some APIs in the FLIP compared with the
> > original version according to which I implemented the POC. As the core
> > optimization logic remains the same and the POC's performance can
> > still reflect the current FLIP's expected improvement, I have not
> > updated the POC code after that. I'll add a note on the benchmark
> > section of the FLIP saying that the namings in the POC code might be
> > outdated, and FLIP is still the source of truth for our proposed
> > design.
> >
> > 2. This FLIP could bring a fixed reduction on the workload of the
> > per-record serialization path in Flink, so if the absolute time cost
> > by non-optimized components could be lower, the performance
> > improvement of this FLIP would be more obvious. That's why I chose to
> > enable object-reuse and to transmit Boolean values in serialization.
> > If it would be more widely regarded as acceptable for a benchmark to
> > adopt more commonly-applied behavior(for object reuse, I believe
> > disable is more common), I would be glad to update the benchmark
> > result to disable object reuse.
> >
> > Best regards,
> > Yunfeng
> >
> >
> > On Thu, Jul 13, 2023 at 6:37 AM Jing Ge 
> > wrote:
> >
> > Hi Yunfeng,
> >
> > Thanks for the proposal. It makes sense to offer the optimization. I got
> > some NIT questions.
> >
> > 1. I guess you changed your thoughts while coding the POC, I found
> > pipeline.enable-operator-timestamp in the code but  is
> > pipeline.force-timestamp-support defined in the FLIP
> > 2. about the benchmark example, why did you enable object reuse? Since It
> > is an optimization of serde, will the benchmark be better if it is
> > disabled?
> >
> > Best regards,
> > Jing
> >
> > On Mon, Jul 10, 2023 at 11:54 AM Yunfeng Zhou <
> flink.zhouyunf...@gmail.com
> > >
> > wrote:
> >
> > Hi all,
> >
> > Dong(cc'ed) and I are opening this thread to discuss our proposal to
> > support optimizing StreamRecord's serialization performance.
> >
> > Currently, a StreamRecord would be converted into a 1-byte tag (+
> > 8-byte timestamp) + N-byte serialized value during the serialization
> > process. In scenarios where timestamps and watermarks are not needed,
> > and latency tracking is enabled, this process would 

Re: [DISCUSS] Add missing visibility annotation for Table APIs

2023-07-31 Thread Jark Wu
he
> > public
> > > > > interfaces of the flip[2], I'm fine with making them all public
> ones
> > or
> > > > > just excluding the Trigger implemantors, cc @Qingsheng can you also
> > > help
> > > > to
> > > > > check this?
> > > > >
> > > >
> > > > I'm fine with both. Let's wait for Qingsheng's opinion.
> > > >
> > > >
> > > > For the `BuiltInFunctionDefinitions$Builder`, I think it should be
> > > > > `BuiltInFunctionDefinition$Builder`
> > > >
> > > >
> > > > Yes, this is a typo, and I've corrected it.
> > > >
> > > >
> > > > To Jing
> > > >
> > > > do we really need to
> > > > > mark some many classes as @Internal? What is the exactly different
> > > > between
> > > > > a public class with no annotation and with the @Internal?
> > > >
> > > >
> > > > IMO it is still necessary. From a user's perspective, marking a class
> > as
> > > > @Internal has a clear directionality, indicating that this is an
> > internal
> > > > class, and I should not rely on it. However, facing an unmarked
> class,
> > I
> > > > wonder whether it is safe to depend on it in my code. From a
> > developer's
> > > > perspective, marking a class as @Internal also helps us to be more
> > > > confident when iterating and updating interfaces. We can be sure that
> > > this
> > > > proposed change will not have unexpected behavior (because we have
> > > informed
> > > > users that it is internal and cannot promise the same compatibility
> > > > guarantee as public APIs).
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-248%3A+Introduce+dynamic+partition+pruning#:~:text=PublicEvolving%0Apublic%20class-,DynamicFilteringEvent,-implements%20SourceEvent%20%7B
> > > >
> > > > Best,
> > > > Jane
> > > >
> > > > On Wed, Jul 26, 2023 at 1:26 AM Jing Ge 
> > > > wrote:
> > > >
> > > > > Hi Jane,
> > > > >
> > > > > Thanks for your effort of walking through all classes and compiling
> > the
> > > > > sheet. It is quite helpful. Just out of curiosity, do we really
> need
> > to
> > > > > mark some many classes as @Internal? What is the exactly different
> > > > between
> > > > > a public class with no annotation and with the @Internal?
> > > > >
> > > > > Best regards,
> > > > > Jing
> > > > >
> > > > >
> > > > > On Tue, Jul 25, 2023 at 11:06 AM Lincoln Lee <
> lincoln.8...@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > Hi Jane,
> > > > > >
> > > > > > Thanks for driving this! Overall the proposed annotations looks
> > good
> > > to
> > > > > me.
> > > > > > Some comments for the table[1]:
> > > > > >
> > > > > > For the `DynamicFilteringEvent`, tend to keep it 'internal' since
> > > it's
> > > > a
> > > > > > concreate implementation of `SourceEvent`(and other two
> > implementers
> > > > are
> > > > > > not public ones) .
> > > > > >
> > > > > > For the `LookupOptions` and `Trigger`s, because they're all in
> the
> > > > public
> > > > > > interfaces of the flip[2], I'm fine with making them all public
> > ones
> > > or
> > > > > > just excluding the Trigger implemantors, cc @Qingsheng can you
> also
> > > > help
> > > > > to
> > > > > > check this?
> > > > > >
> > > > > > For the `BuiltInFunctionDefinitions$Builder`, I think it should
> be
> > > > > > `BuiltInFunctionDefinition$Builder`.
> > > > > >
> > > > > >
> > > > > > [1].
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/spreadsheets/d/1e8M0tUtKkZXEd8rCZtZ0C6Ty9QkNaPySsrCgz0vEID4/edit?usp=sharing
> > > > > > [

Re: [VOTE] Release 2.0 must-have work items - Round 2

2023-07-26 Thread Jark Wu
+1 (binding)

Thanks Xintong for driving this. Thanks all for finalizing the
SourceFunction conclusion.

Best,
Jark

On Wed, 26 Jul 2023 at 22:28, Alexander Fedulov 
wrote:

> +1 (non-binding), assuming SourceFunction gets added back to the
> doc as a "nice-to-have". I am glad we've reached a consensus here.
> Extra thanks to Leonard for coordinating this discussion in particular.
>
> Best,
> Alex
>
> On Wed, 26 Jul 2023 at 15:43, Jing Ge  wrote:
>
> > +1 (non-binding), glad to see we are now on the same page. Thank you all.
> >
> > Best regards,
> > Jing
> >
> > On Wed, Jul 26, 2023 at 5:18 PM Yun Tang  wrote:
> >
> > > +1 (non-binding), thanks @xintong for driving this work.
> > >
> > >
> > > Best
> > > Yun Tang
> > > 
> > > From: Zhu Zhu 
> > > Sent: Wednesday, July 26, 2023 16:35
> > > To: dev@flink.apache.org 
> > > Subject: Re: [VOTE] Release 2.0 must-have work items - Round 2
> > >
> > > +1 (binding)
> > >
> > > Thanks,
> > > Zhu
> > >
> > > Leonard Xu  于2023年7月26日周三 15:40写道:
> > > >
> > > > Thanks @xingtong for driving the work.
> > > >
> > > > +1(binding)
> > > >
> > > > Best,
> > > > Leonard
> > > >
> > > > > On Jul 26, 2023, at 3:18 PM, Konstantin Knauf <
> > > knauf.konstan...@gmail.com> wrote:
> > > > >
> > > > > Hi Xingtong,
> > > > >
> > > > > yes, I am fine with the conclusion for SourceFunction. I chatted
> with
> > > > > Leonard a bit last night. Let's continue this vote.
> > > > >
> > > > > Thanks for the clarification,
> > > > >
> > > > > Konstantin
> > > > >
> > > > >
> > > > >
> > > > > Am Mi., 26. Juli 2023 um 04:03 Uhr schrieb Xintong Song <
> > > > > tonysong...@gmail.com>:
> > > > >
> > > > >> Hi Konstantin,
> > > > >>
> > > > >> It seems the offline discussion has already taken place [1], and
> > part
> > > of
> > > > >> the outcome is that removal of SourceFunction would be a
> > > *nice-to-have*
> > > > >> item for release 2.0 which may not block this *must-have* vote. Do
> > > you have
> > > > >> different opinions about the conclusions in [1]?
> > > > >>
> > > > >> If there are still concerns, and the discussion around this topic
> > > needs to
> > > > >> be continued, then I'd suggest (as I mentioned in [2]) not to
> > further
> > > block
> > > > >> this vote (i.e. the decision on other must-have items). Release
> 2.0
> > > still
> > > > >> has a long way to go, and it is likely we need to review and
> update
> > > the
> > > > >> list every once in a while. We can update the list with another
> vote
> > > if
> > > > >> later we decide to add the removal of SourceFunction to the
> > must-have
> > > list.
> > > > >>
> > > > >> WDYT?
> > > > >>
> > > > >> Best,
> > > > >>
> > > > >> Xintong
> > > > >>
> > > > >>
> > > > >> [1]
> > https://lists.apache.org/thread/yyw52k45x2sp1jszldtdx7hc98n72w7k
> > > > >> [2]
> > https://lists.apache.org/thread/j5d5022ky8k5t088ffm03727o5g9x9jr
> > > > >>
> > > > >> On Tue, Jul 25, 2023 at 8:49 PM Konstantin Knauf <
> kna...@apache.org
> > >
> > > > >> wrote:
> > > > >>
> > > > >>> I assume this vote includes a decision to not removing
> > > > >>> SourceFunction/SinkFunction in Flink 2.0 (as it has been removed
> > > from the
> > > > >>> table). If this is the case, I don't think, this discussion has
> > > > >> concluded.
> > > > >>> There are multiple contributors like myself, Martijn, Alex
> Fedulov
> > > and
> > > > >>> Maximilian Michels, who have indicated they would be in favor of
> > > > >>> deprecating/dropping them. This Source/Sink Function discussion
> > > seems to
> > > > >> go
> > > > >>> in circles in general. I am wondering if it makes sense to have a
> > > call
> > > > >>> about this instead of repeating mailing list discussions.
> > > > >>>
> > > > >>> Am Di., 25. Juli 2023 um 13:38 Uhr schrieb Yu Li <
> car...@gmail.com
> > >:
> > > > >>>
> > > >  +1 (binding)
> > > > 
> > > >  Thanks for driving this, Xintong!
> > > > 
> > > >  Best Regards,
> > > >  Yu
> > > > 
> > > > 
> > > >  On Sun, 23 Jul 2023 at 18:28, Yuan Mei 
> > > wrote:
> > > > 
> > > > > +1 (binding)
> > > > >
> > > > > Thanks for driving the discussion through and for all the
> efforts
> > > in
> > > > > resolving the complexities :-)
> > > > >
> > > > > Best
> > > > > Yuan
> > > > >
> > > > > On Thu, Jul 20, 2023 at 5:23 PM Xintong Song <
> > > tonysong...@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> Hi all,
> > > > >>
> > > > >> I'd like to start another round of VOTE for the must-have work
> > > > >> items
> > > >  for
> > > > >> release 2.0 [1]. The corresponding discussion thread is [2],
> and
> > > > >> the
> > > > >> previous voting thread is [3]. All comments from the previous
> > > > >> voting
> > > > > thread
> > > > >> have been addressed.
> > > > >>
> > > > >> Please note that once the vote is approved, any changes to the
> > > >  must-have
> > > > >> items (adding / removing must-have 

Re: [DISCUSS] Add missing visibility annotation for Table APIs

2023-07-24 Thread Jark Wu
Hi Jane,

Thanks for kicking off this work and collecting the detailed list.

+1 to add the missing annotation.

This often confuses me whether the class can be modified without breaking
the compatibility
 when looking at classes in table-common and table-api. Explicitly mark the
visibility can be
helpful in this case.

I have some additional suggestions wrt the class annotations:
- classes in org.apache.flink.table.catalog.stats.* can be @PublicEvolving,
because all the classes in there are needed to build the @PublicEvolving
 CatalogColumnStatistics.
- PeriodicCacheReloadTrigger and TimedCacheReloadTrigger can be
@PublicEvolving,
they are built-in implementations of cache reload trigger and are exposed
to connectors.
- CoreModule can be @PublicEvolving to allow users use it in
TableEnv#loadModule(name, Module).

Best,
Jark


On Fri, 21 Jul 2023 at 00:34, Jane Chan  wrote:

> Hi, Devs,
>
> I would like to start a discussion on adding missing visibility annotation
> (PublicEvolving / Internal etc.) for Table APIs.
>
> The motivation for starting this discussion was during the cleanup of which
> Table API to be deprecated for version 2.0, I noticed that some of the APIs
> lack visibility annotations, which may lead to users relying on APIs that
> should have been marked as internal.
>
> Therefore, I have compiled a sheet[1] listing the currently unmarked
> classes under the table-api-java, table-api-java-bridge, and table-common
> modules and the recommended annotations to be added.
>
> My thought is that all public classes/interfaces within the three modules
> mentioned above should be explicitly marked, and we might consider
> introducing a new architectural rule to perform auto-check on newly added
> APIs in the future.
>
> Let me explain the details.
>
> 1. Why table-api-java, table-api-java-bridge, and table-common?
> Because according to Flink's application project configuration doc[2],
> table-api-java and table-api-java-bridge are the leading dependencies for
> users to develop a table program. Although flink-table-common is not
> listed, it is the core dependency for users to implement a User-Defined
> Function/Connector[3].
>
> 2. How are the classes listed in this table compiled?
> I use a customized Intellij profile to perform the code inspection under
> these three modules to find all public classes/interfaces without API
> visibility annotations, along with a manual check.
>
> 3. How is the suggested API visibility to be determined?
> For all APIs suggested as PublicEvolving, I added a comment on the cell to
> explain the reason. The rest APIs, which are indicated as Internal, are
> either util classes or implementations.
>
> I'm looking forward to your ideas, and it would be great if any interested
> developers could help review this list together.
>
>
> [1]
>
> https://docs.google.com/spreadsheets/d/1e8M0tUtKkZXEd8rCZtZ0C6Ty9QkNaPySsrCgz0vEID4/edit?usp=sharing
> [2]
>
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/configuration/overview/#which-dependencies-do-you-need
> [3]
>
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sourcessinks/#project-configuration
>
>
> Best regards,
> Jane
>


Re: [ANNOUNCE] New Apache Flink Committer - Yong Fang

2023-07-24 Thread Jark Wu
Congratulations, Yong Fang!

Best,
Jark

On Mon, 24 Jul 2023 at 22:11, Wencong Liu  wrote:

> Congratulations!
>
> Best,
> Wencong Liu
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> 在 2023-07-24 11:03:30,"Paul Lam"  写道:
> >Congrats, Shammon!
> >
> >Best,
> >Paul Lam
> >
> >> 2023年7月24日 10:56,Jingsong Li  写道:
> >>
> >> Shammon
> >
>


Re: [VOTE] FLIP-309: Support using larger checkpointing interval when source is processing backlog

2023-07-18 Thread Jark Wu
+1 (binding)

Best,
Jark

On Tue, 18 Jul 2023 at 20:30, Piotr Nowojski  wrote:

> +1 (binding)
>
> Piotrek
>
> wt., 18 lip 2023 o 08:51 Jing Ge  napisał(a):
>
> > +1(binding)
> >
> > Best regards,
> > Jing
> >
> > On Tue, Jul 18, 2023 at 8:31 AM Rui Fan <1996fan...@gmail.com> wrote:
> >
> > > +1(binding)
> > >
> > > Best,
> > > Rui Fan
> > >
> > >
> > > On Tue, Jul 18, 2023 at 12:04 PM Dong Lin  wrote:
> > >
> > > > Hi all,
> > > >
> > > > We would like to start the vote for FLIP-309: Support using larger
> > > > checkpointing interval when source is processing backlog [1]. This
> FLIP
> > > was
> > > > discussed in this thread [2].
> > > >
> > > > The vote will be open until at least July 21st (at least 72 hours),
> > > > following
> > > > the consensus voting process.
> > > >
> > > > Cheers,
> > > > Yunfeng and Dong
> > > >
> > > > [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-309
> > > >
> > > >
> > >
> >
> %3A+Support+using+larger+checkpointing+interval+when+source+is+processing+backlog
> > > > [2] https://lists.apache.org/thread/l1l7f30h7zldjp6ow97y70dcthx7tl37
> > > >
> > >
> >
>


Re: [DISCUSS] Update Flink Roadmap

2023-07-16 Thread Jark Wu
Hi Jiabao,

Thank you for your suggestions. I have added them to the "Going Beyond a
SQL Stream/Batch Processing Engine" and "Large-Scale State Jobs" sections.

Best,
Jark

On Thu, 13 Jul 2023 at 16:06, Jiabao Sun 
wrote:

> Thanks Jark and Martijn for driving this.
>
> There are two suggestions about the Table API:
>
> - Add the JSON type to adapt to the no sql database type.
> - Remove changelog normalize operator for upsert stream.
>
>
> Best,
> Jiabao
>
>
> > 2023年7月13日 下午3:49,Jark Wu  写道:
> >
> > Hi all,
> >
> > Sorry for taking so long back here.
> >
> > Martijn and I have drafted the first version of the updated roadmap,
> > including the updated feature radar reflecting the current state of
> > different components.
> >
> https://docs.google.com/document/d/12BDiVKEsY-f7HI3suO_IxwzCmR04QcVqLarXgyJAb7c/edit
> >
> > Feel free to leave comments in the thread or the document.
> > We may miss mentioning something important, so your help in enriching
> > the content is greatly appreciated.
> >
> > Best,
> > Jark & Martijn
> >
> >
> > On Fri, 2 Jun 2023 at 00:50, Jing Ge  wrote:
> >
> >> Hi Jark,
> >>
> >> Fair enough. Let's do it like you suggested. Thanks!
> >>
> >> Best regards,
> >> Jing
> >>
> >> On Thu, Jun 1, 2023 at 6:00 PM Jark Wu  wrote:
> >>
> >>> Hi Jing,
> >>>
> >>> This thread is for discussing the roadmap for versions 1.18, 2.0, and
> >> even
> >>> more.
> >>> One of the outcomes of this discussion will be an updated version of
> the
> >>> current roadmap.
> >>> Let's work together on refining the roadmap in this thread.
> >>>
> >>> Best,
> >>> Jark
> >>>
> >>> On Thu, 1 Jun 2023 at 23:25, Jing Ge 
> wrote:
> >>>
> >>>> Hi Jark,
> >>>>
> >>>> Thanks for driving it! For point 2, since we are developing 1.18 now,
> >>>> does it make sense to update the roadmap this time while we are
> >> releasing
> >>>> 1.18? This discussion thread will be focusing on the Flink 2.0
> roadmap,
> >>> as
> >>>> you mentioned previously. WDYT?
> >>>>
> >>>> Best regards,
> >>>> Jing
> >>>>
> >>>> On Thu, Jun 1, 2023 at 3:31 PM Jark Wu  wrote:
> >>>>
> >>>>> Hi all,
> >>>>>
> >>>>> Martijn and I would like to initiate a discussion on the Flink
> >> roadmap,
> >>>>> which should cover the project's long-term roadmap and the regular
> >>> update
> >>>>> mechanism.
> >>>>>
> >>>>> Xintong has already started a discussion about Flink 2.0 planning.
> >> One
> >>> of
> >>>>> the points raised in that discussion is that we should have a
> >>> high-level
> >>>>> discussion of the roadmap to present where the project is heading
> >>> (which
> >>>>> doesn't necessarily need to block the Flink 2.0 planning). Moreover,
> >>> the
> >>>>> roadmap on the Flink website [1] hasn't been updated for half a year,
> >>> and
> >>>>> the last update was for the feature radar for the 1.15 release. It
> >> has
> >>>> been
> >>>>> 2 years since the community discussed Flink's overall roadmap.
> >>>>>
> >>>>> I would like to raise two topics for discussion:
> >>>>>
> >>>>> 1. The new roadmap. This should be an updated version of the current
> >>>>> roadmap[1].
> >>>>> 2. A mechanism to regularly discuss and update the roadmap.
> >>>>>
> >>>>> To make the first topic discussion more efficient, Martijn and I
> >>>> volunteer
> >>>>> to summarize the ongoing big things of different components and
> >>> present a
> >>>>> roadmap draft to the community in the next few weeks. This should be
> >> a
> >>>> good
> >>>>> starting point for a more detailed discussion.
> >>>>>
> >>>>> Regarding the regular update mechanism, there was a proposal in a
> >>> thread
> >>>>> [2] three years ago to make the release manager responsible for
> >>> updating
> >>>>> the roadmap. However, it appears that this was not documented as a
> >>>> release
> >>>>> management task [3], and the roadmap update wasn't performed for
> >>> releases
> >>>>> 1.16 and 1.17.
> >>>>>
> >>>>> In my opinion, making release managers responsible for keeping the
> >>>> roadmap
> >>>>> up to date is a good idea. Specifically, release managers of release
> >> X
> >>>> can
> >>>>> kick off the roadmap update at the beginning of release X, which can
> >>> be a
> >>>>> joint task with collecting a feature list [4]. Additionally, release
> >>>>> managers of release X-1 can help verify and remove the accomplished
> >>> items
> >>>>> from the roadmap and update the feature radar.
> >>>>>
> >>>>> What do you think? Do you have other ideas?
> >>>>>
> >>>>> Best,
> >>>>> Jark & Martijn
> >>>>>
> >>>>> [1]: https://flink.apache.org/roadmap.html
> >>>>> [2]:
> >> https://lists.apache.org/thread/o0l3cg6yphxwrww0k7215jgtw3yfoybv
> >>>>> [3]:
> >>>>>
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Release+Management
> >>>>> [4]: https://cwiki.apache.org/confluence/display/FLINK/1.18+Release
> >>>>>
> >>>>
> >>>
> >>
>
>


Re: [DISCUSS] Update Flink Roadmap

2023-07-13 Thread Jark Wu
Hi all,

Sorry for taking so long back here.

Martijn and I have drafted the first version of the updated roadmap,
including the updated feature radar reflecting the current state of
different components.
https://docs.google.com/document/d/12BDiVKEsY-f7HI3suO_IxwzCmR04QcVqLarXgyJAb7c/edit

Feel free to leave comments in the thread or the document.
We may miss mentioning something important, so your help in enriching
the content is greatly appreciated.

Best,
Jark & Martijn


On Fri, 2 Jun 2023 at 00:50, Jing Ge  wrote:

> Hi Jark,
>
> Fair enough. Let's do it like you suggested. Thanks!
>
> Best regards,
> Jing
>
> On Thu, Jun 1, 2023 at 6:00 PM Jark Wu  wrote:
>
> > Hi Jing,
> >
> > This thread is for discussing the roadmap for versions 1.18, 2.0, and
> even
> > more.
> > One of the outcomes of this discussion will be an updated version of the
> > current roadmap.
> > Let's work together on refining the roadmap in this thread.
> >
> > Best,
> > Jark
> >
> > On Thu, 1 Jun 2023 at 23:25, Jing Ge  wrote:
> >
> > > Hi Jark,
> > >
> > > Thanks for driving it! For point 2, since we are developing 1.18 now,
> > > does it make sense to update the roadmap this time while we are
> releasing
> > > 1.18? This discussion thread will be focusing on the Flink 2.0 roadmap,
> > as
> > > you mentioned previously. WDYT?
> > >
> > > Best regards,
> > > Jing
> > >
> > > On Thu, Jun 1, 2023 at 3:31 PM Jark Wu  wrote:
> > >
> > > > Hi all,
> > > >
> > > > Martijn and I would like to initiate a discussion on the Flink
> roadmap,
> > > > which should cover the project's long-term roadmap and the regular
> > update
> > > > mechanism.
> > > >
> > > > Xintong has already started a discussion about Flink 2.0 planning.
> One
> > of
> > > > the points raised in that discussion is that we should have a
> > high-level
> > > > discussion of the roadmap to present where the project is heading
> > (which
> > > > doesn't necessarily need to block the Flink 2.0 planning). Moreover,
> > the
> > > > roadmap on the Flink website [1] hasn't been updated for half a year,
> > and
> > > > the last update was for the feature radar for the 1.15 release. It
> has
> > > been
> > > > 2 years since the community discussed Flink's overall roadmap.
> > > >
> > > > I would like to raise two topics for discussion:
> > > >
> > > > 1. The new roadmap. This should be an updated version of the current
> > > > roadmap[1].
> > > > 2. A mechanism to regularly discuss and update the roadmap.
> > > >
> > > > To make the first topic discussion more efficient, Martijn and I
> > > volunteer
> > > > to summarize the ongoing big things of different components and
> > present a
> > > > roadmap draft to the community in the next few weeks. This should be
> a
> > > good
> > > > starting point for a more detailed discussion.
> > > >
> > > > Regarding the regular update mechanism, there was a proposal in a
> > thread
> > > > [2] three years ago to make the release manager responsible for
> > updating
> > > > the roadmap. However, it appears that this was not documented as a
> > > release
> > > > management task [3], and the roadmap update wasn't performed for
> > releases
> > > > 1.16 and 1.17.
> > > >
> > > > In my opinion, making release managers responsible for keeping the
> > > roadmap
> > > > up to date is a good idea. Specifically, release managers of release
> X
> > > can
> > > > kick off the roadmap update at the beginning of release X, which can
> > be a
> > > > joint task with collecting a feature list [4]. Additionally, release
> > > > managers of release X-1 can help verify and remove the accomplished
> > items
> > > > from the roadmap and update the feature radar.
> > > >
> > > > What do you think? Do you have other ideas?
> > > >
> > > > Best,
> > > > Jark & Martijn
> > > >
> > > > [1]: https://flink.apache.org/roadmap.html
> > > > [2]:
> https://lists.apache.org/thread/o0l3cg6yphxwrww0k7215jgtw3yfoybv
> > > > [3]:
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Release+Management
> > > > [4]: https://cwiki.apache.org/confluence/display/FLINK/1.18+Release
> > > >
> > >
> >
>


Re: [VOTE] Release 2.0 must-have work items

2023-07-09 Thread Jark Wu
+1  (binding)

Thanks for driving this. Looking forward to starting the 2.0 works.

Best,
Jark

On Fri, 7 Jul 2023 at 17:31, Xintong Song  wrote:

> Hi all,
>
> I'd like to start the VOTE for the must-have work items for release 2.0
> [1]. The corresponding discussion thread is [2].
>
> Please note that once the vote is approved, any changes to the must-have
> items (adding / removing must-have items, changing the priority) requires
> another vote. Assigning contributors / reviewers, updating descriptions /
> progress, changes to nice-to-have items do not require another vote.
>
> The vote will be open until at least July 12, following the consensus
> voting process. Votes of PMC members are binding.
>
> Best,
>
> Xintong
>
>
> [1] https://cwiki.apache.org/confluence/display/FLINK/2.0+Release
>
> [2] https://lists.apache.org/thread/l3dkdypyrovd3txzodn07lgdwtwvhgk4
>


Re: [ANNOUNCE] Apache Flink has won the 2023 SIGMOD Systems Award

2023-07-03 Thread Jark Wu
Congrats everyone!

Best,
Jark

> 2023年7月3日 22:37,Yuval Itzchakov  写道:
> 
> Congrats team!
> 
> On Mon, Jul 3, 2023, 17:28 Jing Ge via user  > wrote:
>> Congratulations!
>> 
>> Best regards,
>> Jing
>> 
>> 
>> On Mon, Jul 3, 2023 at 3:21 PM yuxia > > wrote:
>>> Congratulations!
>>> 
>>> Best regards,
>>> Yuxia
>>> 
>>> 发件人: "Pushpa Ramakrishnan" >> >
>>> 收件人: "Xintong Song" mailto:tonysong...@gmail.com>>
>>> 抄送: "dev" mailto:dev@flink.apache.org>>, "User" 
>>> mailto:u...@flink.apache.org>>
>>> 发送时间: 星期一, 2023年 7 月 03日 下午 8:36:30
>>> 主题: Re: [ANNOUNCE] Apache Flink has won the 2023 SIGMOD Systems Award
>>> 
>>> Congratulations \uD83E\uDD73 
>>> 
>>> On 03-Jul-2023, at 3:30 PM, Xintong Song >> > wrote:
>>> 
>>> 
>>> Dear Community,
>>> 
>>> I'm pleased to share this good news with everyone. As some of you may have 
>>> already heard, Apache Flink has won the 2023 SIGMOD Systems Award [1].
>>> 
>>> "Apache Flink greatly expanded the use of stream data-processing." -- 
>>> SIGMOD Awards Committee
>>> 
>>> SIGMOD is one of the most influential data management research conferences 
>>> in the world. The Systems Award is awarded to an individual or set of 
>>> individuals to recognize the development of a software or hardware system 
>>> whose technical contributions have had significant impact on the theory or 
>>> practice of large-scale data management systems. Winning of the award 
>>> indicates the high recognition of Flink's technological advancement and 
>>> industry influence from academia.
>>> 
>>> As an open-source project, Flink wouldn't have come this far without the 
>>> wide, active and supportive community behind it. Kudos to all of us who 
>>> helped make this happen, including the over 1,400 contributors and many 
>>> others who contributed in ways beyond code.
>>> 
>>> Best,
>>> Xintong (on behalf of the Flink PMC)
>>> 
>>> [1] https://sigmod.org/2023-sigmod-systems-award/
>>> 



Re: [DISCUSS] Release 2.0 Work Items

2023-06-29 Thread Jark Wu
Hi,

I think one more thing we need to consider to do in 2.0 is changing the
default value of configuration to improve out-of-box user experience.

Currently, in order to run a Flink job, users may need to set
a bunch of configurations, such as minibatch, checkpoint interval,
exactly-once,
incremental-checkpoint, etc. It's very verbose and hard to use for
beginners.
Most of them can have a universally applicable value.  Because changing the
default value is a breaking change. I think It's worth considering changing
them in 2.0.

What do you think?

Best,
Jark


On Wed, 28 Jun 2023 at 14:10, Sergey Nuyanzin  wrote:

> Hi Chesnay
>
> >"Move Calcite rules from Scala to Java": I would hope that this would be
> >an entirely internal change, and could thus be an incremental process
> >independent of major releases.
> >What is the actual scale of this item; how much are we actually
> re-writing?
>
> Thanks for asking
> yes, you're right, that should be internal change.
> Yeah I was also thinking about incremental change (rule by rule or
> reasonable small group of rules).
> And yes, this could be an independent (on major release) activity
>
> The problem is actually for children of RelOptRule.
> Currently I see 60+ such rules (in Scala) using the mentioned deprecated
> api.
> There are also children of ConverterRule (50+) which do not have such
> issues.
> Maybe it could be considered as the next step to have all the rules in
> Java.
>
> On Tue, Jun 27, 2023 at 1:34 PM Xintong Song 
> wrote:
>
> > Hi Alex & Gyula,
> >
> > By compatibility discussion do you mean the "[DISCUSS] FLIP-321:
> Introduce
> > > an API deprecation process" thread [1]?
> > >
> >
> > Yes, I meant the FLIP-321 discussion. I just noticed I pasted the wrong
> url
> > in my previous email. Sorry for the mistake.
> >
> > I am also curious to know if the rationale behind this new API has been
> > > previously discussed on the mailing list. Do we have a list of
> > shortcomings
> > > in the current DataStream API that it tries to resolve? How does the
> > > current ProcessFunction functionality fit into the picture? Will it be
> > kept
> > > as is or subsumed by new API?
> > >
> >
> > I don't think we should create a replacement for the DataStream API
> unless
> > > we have a very good reason to do so and with a proper discussion about
> > this
> > > as Alex said.
> >
> >
> > The ProcessFunction API which is targeting to replace DataStream API is
> > still a proposal, not a decision. Sorry for the confusion, I should have
> > been more careful with my words, not giving the impression that this is
> > something we'll do anyway.
> >
> > There will be a FLIP describing the motivations and designs in detail,
> for
> > the community to discuss and vote on. We are still working on it. TBH,
> this
> > is not trivial and we would need more time on it.
> >
> > Just to quickly share some backgrounds:
> >
> >- We see quite some problems with the current DataStream APIs
> >   - Users are working with concrete classes rather than interfaces,
> >   which means
> >   - Users can access methods that are designed to be used by internal
> >  classes, even though they are annotated with `@Internal`. E.g.,
> >  `DataStream#getTransformation`.
> >  - Changes to the non-API implementations (e.g.,
> `Transformation`)
> >  would affect the API classes (e.g., `DataStream`), which
> > makes it hard to
> >  provide binary compatibility.
> >   - Internal classes are used as parameter / return-value of public
> >   APIs. E.g., while `AbstractStreamOperator` is PublicEvolving,
> > `StreamTask`
> >   which returns from `AbstractStreamOperator#getContainingTask` is
> > Internal.
> >   - In many cases, users are asked to extend the API classes, rather
> >   than implementing interfaces. E.g., `AbstractStreamOperator`.
> >  - Any changes to the base classes, even the internal part, may
> >  affect the behavior of the user-provided sub-classes
> >  - Users can override the behavior of the base classes
> >   - The API module `flink-streaming-java` contains non-API classes,
> and
> >   depends on internal modules such as `flink-runtime`, which means
> >   - Changes to the internal modules may affect the API modules, which
> >  requires users to re-build their applications upon upgrading
> >  - The artifact user needs for building their application larger
> >  than necessary.
> >   - We probably should not expose operators (e.g.,
> >   `AbstractStreamOperator`) to users. Functions should be enough
> > for users to
> >   define their data processing logics. Exposing operator-level
> concepts
> >   (e.g., mailbox thread model, checkpoint barrier alignment, etc.) is
> >   unnecessary and limits the improvement regarding such exposed
> > mechanisms
> >   with compatibility considerations.
> >   - The current DataStream API seem

Re: [VOTE] FLIP-309: Support using larger checkpointing interval when source is processing backlog

2023-06-29 Thread Jark Wu
+1 (binding)

Best,
Jark

> 2023年6月29日 18:12,Jing Ge  写道:
> 
> +1(binding)
> 
> On Thu, Jun 29, 2023 at 7:47 AM Leonard Xu  wrote:
> 
>> +1 (binding)
>> 
>> Best,
>> Leonard
>> 
>>> On Jun 29, 2023, at 1:25 PM, Jingsong Li  wrote:
>>> 
>>> +1 binding
>>> 
>>> On Thu, Jun 29, 2023 at 11:03 AM Dong Lin  wrote:
 
 Hi all,
 
 We would like to start the vote for FLIP-309: Support using larger
 checkpointing interval when source is processing backlog [1]. This FLIP
>> was
 discussed in this thread [2].
 
 Flink 1.18 release will feature freeze on July 11. We hope to make this
 feature available in Flink 1.18.
 
 The vote will be open until at least July 4th (at least 72 hours),
>> following
 the consensus voting process.
 
 Cheers,
 Yunfeng and Dong
 
 [1]
 
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-309%3A+Support+using+larger+checkpointing+interval+when+source+is+processing+backlog
 [2] https://lists.apache.org/thread/l1l7f30h7zldjp6ow97y70dcthx7tl37
>> 
>> 



Re: [DISCUSS] FLIP-309: Enable operators to trigger checkpoints dynamically

2023-06-27 Thread Jark Wu
Thank you Dong for driving this FLIP. 

The new design looks good to me!

Best,
Jark

> 2023年6月27日 14:38,Dong Lin  写道:
> 
> Thank you Leonard for the review!
> 
> Hi Piotr, do you have any comments on the latest proposal?
> 
> I am wondering if it is OK to start the voting thread this week.
> 
> On Mon, Jun 26, 2023 at 4:10 PM Leonard Xu  wrote:
> 
>> Thanks Dong for driving this FLIP forward!
>> 
>> Introducing  `backlog status` concept for flink job makes sense to me as
>> following reasons:
>> 
>> From concept/API design perspective, it’s more general and natural than
>> above proposals as it can be used in HybridSource for bounded records, CDC
>> Source for history snapshot and general sources like KafkaSource for
>> historical messages.
>> 
>> From user cases/requirements, I’ve seen many users manually to set larger
>> checkpoint interval during backfilling and then set a shorter checkpoint
>> interval for real-time processing in their production environments as a
>> flink application optimization. Now, the flink framework can make this
>> optimization no longer require the user to set the checkpoint interval and
>> restart the job multiple times.
>> 
>> Following supporting using larger checkpoint for job under backlog status
>> in current FLIP, we can explore supporting larger parallelism/memory/cpu
>> for job under backlog status in the future.
>> 
>> In short, the updated FLIP looks good to me.
>> 
>> 
>> Best,
>> Leonard
>> 
>> 
>>> On Jun 22, 2023, at 12:07 PM, Dong Lin  wrote:
>>> 
>>> Hi Piotr,
>>> 
>>> Thanks again for proposing the isProcessingBacklog concept.
>>> 
>>> After discussing with Becket Qin and thinking about this more, I agree it
>>> is a better idea to add a top-level concept to all source operators to
>>> address the target use-case.
>>> 
>>> The main reason that changed my mind is that isProcessingBacklog can be
>>> described as an inherent/nature attribute of every source instance and
>> its
>>> semantics does not need to depend on any specific checkpointing policy.
>>> Also, we can hardcode the isProcessingBacklog behavior for the sources we
>>> have considered so far (e.g. HybridSource and MySQL CDC source) without
>>> asking users to explicitly configure the per-source behavior, which
>> indeed
>>> provides better user experience.
>>> 
>>> I have updated the FLIP based on the latest suggestions. The latest FLIP
>> no
>>> longer introduces per-source config that can be used by end-users. While
>> I
>>> agree with you that CheckpointTrigger can be a useful feature to address
>>> additional use-cases, I am not sure it is necessary for the use-case
>>> targeted by FLIP-309. Maybe we can introduce CheckpointTrigger separately
>>> in another FLIP?
>>> 
>>> Can you help take another look at the updated FLIP?
>>> 
>>> Best,
>>> Dong
>>> 
>>> 
>>> 
>>> On Fri, Jun 16, 2023 at 11:59 PM Piotr Nowojski 
>>> wrote:
>>> 
 Hi Dong,
 
> Suppose there are 1000 subtask and each subtask has 1% chance of being
> "backpressured" at a given time (due to random traffic spikes). Then at
 any
> given time, the chance of the job
> being considered not-backpressured = (1-0.01)^1000. Since we evaluate
>> the
> backpressure metric once a second, the estimated time for the job
> to be considered not-backpressured is roughly 1 / ((1-0.01)^1000) =
>> 23163
> sec = 6.4 hours.
> 
> This means that the job will effectively always use the longer
> checkpointing interval. It looks like a real concern, right?
 
 Sorry I don't understand where you are getting those numbers from.
 Instead of trying to find loophole after loophole, could you try to
>> think
 how a given loophole could be improved/solved?
 
> Hmm... I honestly think it will be useful to know the APIs due to the
> following reasons.
 
 Please propose something. I don't think it's needed.
 
> - For the use-case mentioned in FLIP-309 motivation section, would the
 APIs
> of this alternative approach be more or less usable?
 
 Everything that you originally wanted to achieve in FLIP-309, you could
>> do
 as well in my proposal.
 Vide my many mentions of the "hacky solution".
 
> - Can these APIs reliably address the extra use-case (e.g. allow
> checkpointing interval to change dynamically even during the unbounded
> phase) as it claims?
 
 I don't see why not.
 
> - Can these APIs be decoupled from the APIs currently proposed in
 FLIP-309?
 
 Yes
 
> For example, if the APIs of this alternative approach can be decoupled
 from
> the APIs currently proposed in FLIP-309, then it might be reasonable to
> work on this extra use-case with a more advanced/complicated design
> separately in a followup work.
 
 As I voiced my concerns previously, the current design of FLIP-309 would
 clog the public API and in the long run confuse the users. IMO It's
 addressi

[jira] [Created] (FLINK-32444) Enable object reuse for Flink SQL jobs by default

2023-06-26 Thread Jark Wu (Jira)
Jark Wu created FLINK-32444:
---

 Summary: Enable object reuse for Flink SQL jobs by default
 Key: FLINK-32444
 URL: https://issues.apache.org/jira/browse/FLINK-32444
 Project: Flink
  Issue Type: New Feature
  Components: Table SQL / API
Reporter: Jark Wu
 Fix For: 1.18.0


Currently, object reuse is not enabled by default for Flink Streaming Jobs, but 
is enabled by default for Flink Batch jobs. That is not consistent for 
stream-batch unification. Besides, SQL operators are safe to enable object 
reuse and this is a great performance improvement for SQL jobs. 

We should also be careful with the Table-DataStream conversion case 
(StreamTableEnvironment) which is not safe to enable object reuse by default. 
Maybe we can just enable it for SQL Client/Gateway and TableEnvironment. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   6   7   8   9   10   >