Re: Re: [ANNOUNCE] Apache Flink CDC 3.1.0 released

2024-05-19 Thread Jingsong Li
Amazing, congrats!

Best,
Jingsong

On Sat, May 18, 2024 at 3:10 PM 大卫415 <2446566...@qq.com.invalid> wrote:
>
> 退订
>
>
>
>
>
>
>
> Original Email
>
>
>
> Sender:"gongzhongqiang"< gongzhongqi...@apache.org >;
>
> Sent Time:2024/5/17 23:10
>
> To:"Qingsheng Ren"< re...@apache.org >;
>
> Cc recipient:"dev"< d...@flink.apache.org >;"user"< u...@flink.apache.org 
> >;"user-zh"< user-zh@flink.apache.org >;"Apache Announce List"< 
> annou...@apache.org >;
>
> Subject:Re: [ANNOUNCE] Apache Flink CDC 3.1.0 released
>
>
> Congratulations !
> Thanks for all contributors.
>
>
> Best,
>
> Zhongqiang Gong
>
> Qingsheng Ren  于 2024年5月17日周五 17:33写道:
>
> > The Apache Flink community is very happy to announce the release of
> > Apache Flink CDC 3.1.0.
> >
> > Apache Flink CDC is a distributed data integration tool for real time
> > data and batch data, bringing the simplicity and elegance of data
> > integration via YAML to describe the data movement and transformation
> > in a data pipeline.
> >
> > Please check out the release blog post for an overview of the release:
> >
> > 
> https://flink.apache.org/2024/05/17/apache-flink-cdc-3.1.0-release-announcement/
> >
> > The release is available for download at:
> > https://flink.apache.org/downloads.html
> >
> > Maven artifacts for Flink CDC can be found at:
> > https://search.maven.org/search?q=g:org.apache.flink%20cdc
> >
> > The full release notes are available in Jira:
> >
> > 
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12354387
> >
> > We would like to thank all contributors of the Apache Flink community
> > who made this release possible!
> >
> > Regards,
> > Qingsheng Ren
> >


Re: Re: [ANNOUNCE] Apache Flink CDC 3.1.0 released

2024-05-19 Thread Jingsong Li
CC to the Paimon community.

Best,
Jingsong

On Mon, May 20, 2024 at 9:55 AM Jingsong Li  wrote:
>
> Amazing, congrats!
>
> Best,
> Jingsong
>
> On Sat, May 18, 2024 at 3:10 PM 大卫415 <2446566...@qq.com.invalid> wrote:
> >
> > 退订
> >
> >
> >
> >
> >
> >
> >
> > Original Email
> >
> >
> >
> > Sender:"gongzhongqiang"< gongzhongqi...@apache.org >;
> >
> > Sent Time:2024/5/17 23:10
> >
> > To:"Qingsheng Ren"< re...@apache.org >;
> >
> > Cc recipient:"dev"< d...@flink.apache.org >;"user"< 
> > u...@flink.apache.org >;"user-zh"< user-zh@flink.apache.org >;"Apache 
> > Announce List"< annou...@apache.org >;
> >
> > Subject:Re: [ANNOUNCE] Apache Flink CDC 3.1.0 released
> >
> >
> > Congratulations !
> > Thanks for all contributors.
> >
> >
> > Best,
> >
> > Zhongqiang Gong
> >
> > Qingsheng Ren  于 2024年5月17日周五 17:33写道:
> >
> > > The Apache Flink community is very happy to announce the release of
> > > Apache Flink CDC 3.1.0.
> > >
> > > Apache Flink CDC is a distributed data integration tool for real time
> > > data and batch data, bringing the simplicity and elegance of data
> > > integration via YAML to describe the data movement and transformation
> > > in a data pipeline.
> > >
> > > Please check out the release blog post for an overview of the release:
> > >
> > > 
> > https://flink.apache.org/2024/05/17/apache-flink-cdc-3.1.0-release-announcement/
> > >
> > > The release is available for download at:
> > > https://flink.apache.org/downloads.html
> > >
> > > Maven artifacts for Flink CDC can be found at:
> > > https://search.maven.org/search?q=g:org.apache.flink%20cdc
> > >
> > > The full release notes are available in Jira:
> > >
> > > 
> > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12354387
> > >
> > > We would like to thank all contributors of the Apache Flink community
> > > who made this release possible!
> > >
> > > Regards,
> > > Qingsheng Ren
> > >


Re: flinksql 经过优化后,group by字段少了

2024-05-19 Thread Benchao Li
看起来像是因为 "dt = cast(CURRENT_DATE  as string)" 推导 dt 这个字段是个常量,进而被优化掉了。

将 CURRENT_DATE 优化为常量的行为应该只在 batch 模式下才是这样的,你这个 SQL 是跑在 batch 模式下的嘛?

℡小新的蜡笔不见嘞、 <1515827...@qq.com.invalid> 于2024年5月19日周日 01:01写道:
>
> create view tmp_view as
> SELECT
> dt, -- 2
> uid, -- 0
> uname, -- 1
> uage -- 3
> from
> kafkaTable
> where dt = cast(CURRENT_DATE  as string);
>
> insert into printSinkTable
> select
> dt, uid, uname, sum(uage)
> from tmp_view
> group by
> dt,
> uid,
> uname;
>
>
>
> sql 比较简单,首先根据 dt = current_date 条件进行过滤,然后 按照dt、uid、uname 三个字段进行聚合求和操作。
> 但是,经过优化后,生成的 物理结构如下:
> == Optimized Execution Plan ==
> Sink(table=[default_catalog.default_database.printSinkTable], fields=[dt, 
> uid, uname, EXPR$3])
> +- Calc(select=[CAST(CAST(CURRENT_DATE())) AS dt, uid, uname, EXPR$3])
>    +- GroupAggregate(groupBy=[uid, uname], select=[uid, uname, 
> SUM(uage) AS EXPR$3])
>       +- Exchange(distribution=[hash[uid, uname]])
>          +- Calc(select=[uid, uname, uage], 
> where=[(dt = CAST(CURRENT_DATE()))])
>             +- 
> TableSourceScan(table=[[default_catalog, default_database, kafkaTable]], 
> fields=[uid, uname, dt, uage])
>
>
>
> 请问,这个时候,怎么实现按照 dt\uid\uname 三个字段聚合求和。感谢



-- 

Best,
Benchao Li


Re: flinksql 经过优化后,group by字段少了

2024-05-19 Thread Benchao Li
你引用的这个 calcite 的 issue[1] 是在 calcite-1.22.0 版本中就修复了的,Flink 应该从 1.11
版本开始就已经用的是这个 calcite 版本了。

所以你用的是哪个版本的 Flink 呢,感觉这个可能是另外一个问题。如果可以在当前最新的版本 1.19 中复现这个问题的话,可以建一个
issue 来报一个 bug。

PS: 
上面我说的这个行为,我稍微确认下了,这个应该是一个代码生成阶段才做的区分,所以优化过程中并不能识别,所以即使是batch模式下,优化的plan也应该是包含dt字段的。

[1] https://issues.apache.org/jira/browse/CALCITE-3531

℡小新的蜡笔不见嘞、 <1515827...@qq.com.invalid> 于2024年5月20日周一 11:06写道:
>
> 您好,当前是流任务。我跟了下代码,cast(CURRENT_DATE as string) 被识别了常量。这个问题已经在 calcite 
> 中修复了,https://github.com/apache/calcite/pull/1602/files
> 但是,flink 中引用的 calcite 版本并没有修复这个问题。我这边通过自定义 udf 来规避了这个问题。
>
>
>
>
> -- 原始邮件 --
> 发件人:  
>   "user-zh"   
>  
>  发送时间: 2024年5月20日(星期一) 上午10:32
> 收件人: "user-zh"
> 主题: Re: flinksql 经过优化后,group by字段少了
>
>
>
> 看起来像是因为 "dt = cast(CURRENT_DATE  as string)" 推导 dt 这个字段是个常量,进而被优化掉了。
>
> 将 CURRENT_DATE 优化为常量的行为应该只在 batch 模式下才是这样的,你这个 SQL 是跑在 batch 模式下的嘛?
>
> ℡小新的蜡笔不见嘞、 <1515827...@qq.com.invalid> 于2024年5月19日周日 01:01写道:
> >
> > create view tmp_view as
> > SELECT
> > dt, -- 2
> > uid, -- 0
> > uname, -- 1
> > uage -- 3
> > from
> > kafkaTable
> > where dt = cast(CURRENT_DATE  as string);
> >
> > insert into printSinkTable
> > select
> > dt, uid, uname, sum(uage)
> > from tmp_view
> > group by
> > dt,
> > uid,
> > uname;
> >
> >
> >
> > sql 比较简单,首先根据 dt = current_date 条件进行过滤,然后 按照dt、uid、uname 三个字段进行聚合求和操作。
> > 但是,经过优化后,生成的 物理结构如下:
> > == Optimized Execution Plan ==
> > Sink(table=[default_catalog.default_database.printSinkTable], 
> fields=[dt, uid, uname, EXPR$3])
> > +- Calc(select=[CAST(CAST(CURRENT_DATE())) AS dt, uid, uname, EXPR$3])
> >    +- GroupAggregate(groupBy=[uid, uname], 
> select=[uid, uname, SUM(uage) AS EXPR$3])
> >       +- Exchange(distribution=[hash[uid, 
> uname]])
> >          +- 
> Calc(select=[uid, uname, uage], where=[(dt = CAST(CURRENT_DATE()))])
> >             +- 
> TableSourceScan(table=[[default_catalog, default_database, kafkaTable]], 
> fields=[uid, uname, dt, uage])
> >
> >
> >
> > 请问,这个时候,怎么实现按照 dt\uid\uname 三个字段聚合求和。感谢
>
>
>
> --
>
> Best,
> Benchao Li



-- 

Best,
Benchao Li