Re: Re: Re: [VOTE] Accept Flink CDC into Apache Flink

2024-01-11 Thread godfrey he
+1 (binding)

Thanks,
Godfrey

Zhu Zhu  于2024年1月12日周五 14:10写道:
>
> +1 (binding)
>
> Thanks,
> Zhu
>
> Hangxiang Yu  于2024年1月11日周四 14:26写道:
>
> > +1 (non-binding)
> >
> > On Thu, Jan 11, 2024 at 11:19 AM Xuannan Su  wrote:
> >
> > > +1 (non-binding)
> > >
> > > Best,
> > > Xuannan
> > >
> > > On Thu, Jan 11, 2024 at 10:28 AM Xuyang  wrote:
> > > >
> > > > +1 (non-binding)--
> > > >
> > > > Best!
> > > > Xuyang
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > 在 2024-01-11 10:00:11,"Yang Wang"  写道:
> > > > >+1 (binding)
> > > > >
> > > > >
> > > > >Best,
> > > > >Yang
> > > > >
> > > > >On Thu, Jan 11, 2024 at 9:53 AM liu ron  wrote:
> > > > >
> > > > >> +1 non-binding
> > > > >>
> > > > >> Best
> > > > >> Ron
> > > > >>
> > > > >> Matthias Pohl  于2024年1月10日周三
> > 23:05写道:
> > > > >>
> > > > >> > +1 (binding)
> > > > >> >
> > > > >> > On Wed, Jan 10, 2024 at 3:35 PM ConradJam 
> > > wrote:
> > > > >> >
> > > > >> > > +1 non-binding
> > > > >> > >
> > > > >> > > Dawid Wysakowicz  于2024年1月10日周三
> > 21:06写道:
> > > > >> > >
> > > > >> > > > +1 (binding)
> > > > >> > > > Best,
> > > > >> > > > Dawid
> > > > >> > > >
> > > > >> > > > On Wed, 10 Jan 2024 at 11:54, Piotr Nowojski <
> > > pnowoj...@apache.org>
> > > > >> > > wrote:
> > > > >> > > >
> > > > >> > > > > +1 (binding)
> > > > >> > > > >
> > > > >> > > > > śr., 10 sty 2024 o 11:25 Martijn Visser <
> > > martijnvis...@apache.org>
> > > > >> > > > > napisał(a):
> > > > >> > > > >
> > > > >> > > > > > +1 (binding)
> > > > >> > > > > >
> > > > >> > > > > > On Wed, Jan 10, 2024 at 4:43 AM Xingbo Huang <
> > > hxbks...@gmail.com
> > > > >> >
> > > > >> > > > wrote:
> > > > >> > > > > > >
> > > > >> > > > > > > +1 (binding)
> > > > >> > > > > > >
> > > > >> > > > > > > Best,
> > > > >> > > > > > > Xingbo
> > > > >> > > > > > >
> > > > >> > > > > > > Dian Fu  于2024年1月10日周三 11:35写道:
> > > > >> > > > > > >
> > > > >> > > > > > > > +1 (binding)
> > > > >> > > > > > > >
> > > > >> > > > > > > > Regards,
> > > > >> > > > > > > > Dian
> > > > >> > > > > > > >
> > > > >> > > > > > > > On Wed, Jan 10, 2024 at 5:09 AM Sharath <
> > > > >> dsaishar...@gmail.com
> > > > >> > >
> > > > >> > > > > wrote:
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > +1 (non-binding)
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > Best,
> > > > >> > > > > > > > > Sharath
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > On Tue, Jan 9, 2024 at 1:02 PM Venkata Sanath
> > > Muppalla <
> > > > >> > > > > > > > sanath...@gmail.com>
> > > > >> > > > > > > > > wrote:
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > > +1 (non-binding)
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > Thanks,
> > > > >> > > > > > > > > > Sanath
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > On Tue, Jan 9, 2024 at 11:16 AM Peter Huang <
> > > > >> > > > > > > > huangzhenqiu0...@gmail.com>
> > > > >> > > > > > > > > > wrote:
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > > +1 (non-binding)
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > Best Regards
> > > > >> > > > > > > > > > > Peter Huang
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > On Tue, Jan 9, 2024 at 5:26 AM Jane Chan <
> > > > >> > > > > qingyue@gmail.com>
> > > > >> > > > > > > > wrote:
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > > +1 (non-binding)
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > Best,
> > > > >> > > > > > > > > > > > Jane
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > On Tue, Jan 9, 2024 at 8:41 PM Lijie Wang <
> > > > >> > > > > > > > wangdachui9...@gmail.com>
> > > > >> > > > > > > > > > > > wrote:
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > +1 (non-binding)
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > Best,
> > > > >> > > > > > > > > > > > > Lijie
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > Jiabao Sun  > .invalid>
> > > > >> > > > 于2024年1月9日周二
> > > > >> > > > > > > > 19:28写道:
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > +1 (non-binding)
> > > > >> > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > Best,
> > > > >> > > > > > > > > > > > > > Jiabao
> > > > >> > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > On 2024/01/09 09:58:04 xiangyu feng wrote:
> > > > >> > > > > > > > > > > > > > > +1 (non-binding)
> > > > >> > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > Regards,
> > > > >> > > > > > > > > > > > > > > Xiangyu Feng
> > > > >> > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > Danny Cranmer 
> > > 于2024年1月9日周二
> > > > >> > > > 17:50写道:
> > > > >> > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > +1 (binding)
> > > > >> > > > > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > > > > Thanks,
> > > > >> > > > > 

[jira] [Created] (FLINK-31833) Support code-gen fusion for multiple operators

2023-04-18 Thread Godfrey He (Jira)
Godfrey He created FLINK-31833:
--

 Summary: Support code-gen fusion for multiple operators
 Key: FLINK-31833
 URL: https://issues.apache.org/jira/browse/FLINK-31833
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / Planner
Reporter: Godfrey He






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-31767) Improve the implementation for "analyze table" execution

2023-04-10 Thread Godfrey He (Jira)
Godfrey He created FLINK-31767:
--

 Summary: Improve the implementation for "analyze table" execution
 Key: FLINK-31767
 URL: https://issues.apache.org/jira/browse/FLINK-31767
 Project: Flink
  Issue Type: Improvement
  Components: Table SQL / Planner
Reporter: Godfrey He
Assignee: Godfrey He






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] FLIP-292: Enhance COMPILED PLAN to support operator-level state TTL configuration

2023-04-10 Thread godfrey he
+1 (binding)

Best,
Godfrey

Jing Ge  于2023年4月10日周一 18:42写道:
>
> +1 (binding)
>
> Best Regards,
> Jing
>
> On Mon, Apr 10, 2023 at 12:27 PM Lincoln Lee  wrote:
>
> > +1 (binding)
> >
> > Best,
> > Lincoln Lee
> >
> >
> > Jane Chan  于2023年4月10日周一 18:06写道:
> >
> > > Hi developers,
> > >
> > > Thanks for all the feedback on FLIP-292: Enhance COMPILED PLAN to support
> > > operator-level state TTL configuration [1].
> > > Based on the discussion [2], we have come to a consensus, so I would like
> > > to start a vote.
> > >
> > > The vote will last for at least 72 hours (Apr. 13th at 10:00 A.M. GMT)
> > > unless there is an objection or insufficient votes.
> > >
> > > [1]
> > >
> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=240883951
> > > [2] https://lists.apache.org/thread/ffmc96gv8ofoskbxlhtm7w8oxv8nqzct
> > >
> > > Best,
> > > Jane Chan
> > >
> >


Re: [External] [DISCUSS] FLIP-292: Support configuring state TTL at operator level for Table API & SQL programs

2023-04-03 Thread godfrey he
Hi Jane,

Thanks for driving this FLIP.

I think the compiled plan solution and the hint solution do not
conflict, the two can exist at the same time.
The compiled plan solution can address the need of advanced users and
the platform users
which all stateful operators' state TTL can be defined by user. While
the hint solution can address some
 specific simple scenarios, which is very user-friendly, convenient,
and unambiguous to use.

Some stateful operators are not compiled from SQL directly, such as
ChangelogNormalize and
SinkUpsertMaterializer mentioned above,  I notice the the example given by Yisha
has hints propagation problem which does not conform to the current design.
The rough idea about the hint solution should be simple (only the
common operators are supported)
and easy to understand (no hints propagation).

If the hint solution is supported, a compiled plan which is from a
query with state TTL hints
 can also be further modified for the state TTL parts.

So, I prefer the hint solution to be discuss in a separate FLIP.  I
think that FLIP maybe
need a lot discussion.

Best,
Godfrey

周伊莎  于2023年3月30日周四 22:04写道:
>
> Hi Jane,
>
> Thanks for your detailed response.
>
> You mentioned that there are 10k+ SQL jobs in your production
> > environment, but only ~100 jobs' migration involves plan editing. Is 10k+
> > the number of total jobs, or the number of jobs that use stateful
> > computation and need state migration?
> >
>
> 10k is the number of SQL jobs that enable periodic checkpoint. And
> surely if users change their sql which result in changes of the plan, they
> need to do state migration.
>
> - You mentioned that "A truth that can not be ignored is that users
> > usually tend to give up editing TTL(or operator ID in our case) instead of
> > migrating this configuration between their versions of one given job." So
> > what would users prefer to do if they're reluctant to edit the operator
> > ID? Would they submit the same SQL as a new job with a higher version to
> > re-accumulating the state from the earliest offset?
>
>
> You're exactly right. People will tend to re-accumulate the state from a
> given offset by changing the namespace of their checkpoint.
> Namespace is an internal concept and restarting the sql job in a new
> namespace can be simply understood as submitting a new job.
>
> Back to your suggestions, I noticed that FLIP-190 [3] proposed the
> > following syntax to perform plan migration
>
>
> The 'plan migration'  I said in my last reply may be inaccurate.  It's more
> like 'query evolution'. In other word, if a user submitted a sql job with a
> configured compiled plan, and then
> he changes the sql,  the compiled plan changes too, how to move the
> configuration in the old plan to the new plan.
> IIUC, FLIP-190 aims to solve issues in flink version upgrades and leave out
> the 'query evolution' which is a fundamental change to the query. E.g.
> adding a filter condition, a different aggregation.
> And I'm really looking forward to a solution for query evolution.
>
> And I'm also curious about how to use the hint
> > approach to cover cases like
> >
> > - configuring TTL for operators like ChangelogNormalize,
> > SinkUpsertMaterializer, etc., these operators are derived by the planner
> > implicitly
> > - cope with two/multiple input stream operator's state TTL, like join,
> > and other operations like row_number, rank, correlate, etc.
>
>
>  Actually, in our company , we make operators in the query block where the
> hint locates all affected by that hint. For example,
>
> INSERT INTO sink
> > SELECT /*+ STATE_TTL('1D') */
> >id,
> >name,
> >num
> > FROM (
> >SELECT
> >*,
> >ROW_NUMBER() OVER (PARTITION BY id ORDER BY num DESC) as row_num
> >FROM (
> >SELECT
> >*
> >FROM (
> >SELECT
> >id,
> >name,
> >max(num) as num
> >FROM source1
> >GROUP BY
> >id, name, TUMBLE(proc, INTERVAL '1' MINUTE)
> >)
> >GROUP BY
> >id, name, num
> >)
> > )
> > WHERE row_num = 1
> >
>
> In the SQL above, the state TTL of Rank and Agg will be all configured as 1
> day.  If users want to set different TTL for Rank and Agg, they can just
> make these two queries located in two different query blocks.
> It looks quite rough but straightforward enough.  For each side of join
> operator, one of my users proposed a syntax like below:
>
> > /*+ 
> > JOIN_TTL('tables'='left_talbe,right_table','left_ttl'='10','right_ttl'='1')
> >  */
> >
> > We haven't accepted this proposal now, maybe we could find some better
> design for this kind of case. Just for your information.
>
> I think if we want to utilize hints to support fine-grained configuration,
> we can open a new FLIP to discuss it.
> BTW, personally, I'm interested in how to design a graphical interface to
> help users to maintain their custom fine-grain

[ANNOUNCE] New Apache Flink Committer - Jing Ge

2023-02-13 Thread godfrey he
Hi everyone,

On behalf of the PMC, I'm very happy to announce Jing Ge as a new Flink
committer.

Jing has been consistently contributing to the project for over 1 year.
He authored more than 50 PRs and reviewed more than 40 PRs
with mainly focus on connector, test, and document modules.
He was very active on the mailing list (more than 90 threads) last year,
which includes participating in a lot of dev discussions (30+),
providing many effective suggestions for FLIPs and answering
many user questions. He was the Flink Forward 2022 keynote speaker
to help promote Flink and  a trainer for Flink troubleshooting and performance
tuning of Flink Forward 2022 training program.

Please join me in congratulating Jing for becoming a Flink committer!

Best,
Godfrey


Re: [VOTE] FLIP-279 Unified the max display column width for SqlClient and Table APi in both Streaming and Batch execMode

2023-01-16 Thread godfrey he
+1 (binding)

Best,
Godfrey

Shammon FY  于2023年1月12日周四 23:27写道:
>
> +1 (no-binding)
>
>
> Best,
> Shammon
>
> On Thu, Jan 12, 2023 at 8:11 PM Shengkai Fang  wrote:
>
> > +1(binding)
> >
> > Best,
> > Shengkai
> >
> > Jark Wu  于2023年1月12日周四 19:22写道:
> >
> > > +1 (binding)
> > > Thank you for driving this effort.
> > >
> > > Best,
> > > Jark
> > >
> > > > 2023年1月9日 15:46,Jing Ge  写道:
> > > >
> > > > Hi,
> > > >
> > > > I'd like to start a vote on FLIP-279 Unified the max display column
> > width
> > > > for SqlClient and Table APi in both Streaming and Batch execMode. The
> > > > discussion can be found at [1].
> > > >
> > > > The vote will last for at least 72 hours (Jan 12th at 9:00 GMT) unless
> > > > there is an objection or insufficient votes.
> > > >
> > > > [1] https://lists.apache.org/thread/f9p622k8cgcjl0r0b44np5wm8krhtjjz
> > > > [2]
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-279+Unified+the+max+display+column+width+for+SqlClient+and+Table+APi+in+both+Streaming+and+Batch+execMode
> > > >
> > > > Best regards,
> > > > Jing
> > >
> > >
> >


Re: [VOTE] FLIP-280: Introduce EXPLAIN PLAN_ADVICE to provide SQL advice

2023-01-09 Thread godfrey he
+1 (binding)

Best,
Godfrey

Jingsong Li  于2023年1月10日周二 09:56写道:
>
> +1 (binding)
>
> On Mon, Jan 9, 2023 at 6:19 PM Jane Chan  wrote:
> >
> > Hi all,
> >
> > Thanks for all the feedback so far.
> > Based on the discussion[1], we have come to a consensus, so I would like to
> > start a vote on FLIP-280: Introduce EXPLAIN PLAN_ADVICE to provide SQL
> > advice[2].
> >
> > The vote will last for at least 72 hours (Jan 12th at 10:00 GMT)
> > unless there is an objection or insufficient votes.
> >
> > [1] https://lists.apache.org/thread/5xywxv7g43byoh0jbx1b6qo6gx6wjkcz
> > [2]
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-280%3A+Introduce+EXPLAIN+PLAN_ADVICE+to+provide+SQL+advice
> >
> > Best,
> > Jane Chan


Re: [DISCUSS] Adding a option for planner to decide which join reorder rule to choose

2023-01-08 Thread godfrey he
Hi Yunhong,

Thanks for driving this discuss!

This option looks good to me,
and looking forward to contributing this rule back to Apache Calcite.

Best,
Godfrey



yh z  于2023年1月5日周四 15:32写道:
>
> Hi Benchao,
>
> Thanks for your reply.
>
> Since our existing test results are based on multiple performance
> optimization points on the TPC-DS benchmark[1][2], we haven't separately
> tested the performance improvement brought by new bushy join reorder
> rule. I will complete this test recently and update the results to this
> email.
>
> I am very happy to contribute to Calcite. Later, I will push the PR of the
> bushy join reorder rule to Calcite.
>
> [1] https://issues.apache.org/jira/browse/FLINK-27583
> [2] https://issues.apache.org/jira/browse/FLINK-29942
>
> Best regards,
> Yunhong Zheng
>
> Benchao Li  于2023年1月4日周三 19:03写道:
>
> > Hi Yunhong,
> >
> > Thanks for the updating. And introducing the new bushy join reorder
> > algorithm would be great. And I also agree with the newly added config
> > option "table.optimizer.bushy-join-reorder-threshold" and 12 as the default
> > value.
> >
> >
> > > As for optimization
> > > latency, this is the problem to be solved by the parameters to be
> > > introduced in this discussion. When there are many tables need to be
> > > reordered, the optimization latency will increase greatly. But when the
> > > table numbers less than the threshold, the latency is the same as the
> > > LoptOptimizeJoinRule.
> >
> >
> > This sounds great. If possible, could you share more numbers to us? E.g.,
> > what's the latency of optimization when there are 11/12 tables for both
> > approach?
> >
> >  For question #3: The implementation of Calcite MultiJoinOptimizeBushyRule
> > > is very simple, and it will not store the intermediate results at all.
> > So,
> > > the implementation of Calcite cannot get all possible join reorder
> > results
> > > and it cannot combine with the current cost model to get more reasonable
> > > join reorder results.
> >
> >
> > It's ok to do it in Flink as the first step. It would be great to also
> > contribute it to Calcite later if possible, it depends on you.
> >
> > yh z  于2023年1月3日周二 15:27写道:
> >
> > > Hi Benchao,
> > >
> > > Thanks for your reply.
> > >
> > > Actually,  I mistakenly wrote the name "bushy join reorder" to "busy join
> > > reorder". I'm sorry for the trouble bring to you. "Bushy join reorder"
> > > means we can build a bushy join tree based on cost model, but now Flink
> > can
> > > only build a left-deep tree using Calcite LoptOptimizeJoinRule. I hope my
> > > answers can help you solve the following questions:
> > >
> > > For question #1: The biggest advantage of this "bushy join reorder"
> > > strategy over the default Flink left-deep tree strategy is that it can
> > > retail all possible join reorder plans, and then select the optimal plan
> > > according to the cost model. This means that the busy join reorder
> > strategy
> > > can be better combined with the current cost model to get more reasonable
> > > join reorder results. We verified it on the TPC-DS benchmark, with the
> > > spark plan as a reference, the new busy join reorder strategy can make
> > more
> > > TPC-DS query plans be adjusted to be consistent with the Spark plan, and
> > > the execution time is signifcantly reduced.  As for optimization
> > > latency, this is the problem to be solved by the parameters to be
> > > introduced in this discussion. When there are many tables need to be
> > > reordered, the optimization latency will increase greatly. But when the
> > > table numbers less than the threshold, the latency is the same as the
> > > LoptOptimizeJoinRule.
> > >
> > > For question #2: According to my research, many compute or database
> > systems
> > > have the "bushy join reorder" strategies based on dynamic programming.
> > For
> > > example, Spark and PostgresSql use the same strategy, and the threshold
> > be
> > > set to 12. Also, some papers, like [1] and [2], have also researched this
> > > strategy, and [2] set the threshold to 14.
> > >
> > > For question #3: The implementation of Calcite MultiJoinOptimizeBushyRule
> > > is very simple, and it will not store the intermediate results at all.
> > So,
> > > the implementation of Calcite cannot get all possible join reorder
> > results
> > > and it cannot combine with the current cost model to get more reasonable
> > > join reorder results.
> > >
> > >
> > > [1]
> > >
> > >
> > https://courses.cs.duke.edu/compsci516/cps216/spring03/papers/selinger-etal-1979.pdf
> > > [2] https://db.in.tum.de/~radke/papers/hugejoins.pdf
> > >
> > >
> > >
> > > Benchao Li  于2023年1月3日周二 12:54写道:
> > >
> > > > Hi Yunhong,
> > > >
> > > > Thanks for driving this~
> > > >
> > > > I haven't gone deep into the implementation details yet. Regarding the
> > > > general description, I would ask a few questions firstly:
> > > >
> > > > #1, Is there any benchmark results about the optimization latency
> > change
> > > > compared to c

Re: [DISCUSS] FLIP-280: Introduce a new explain mode to provide SQL advice

2023-01-02 Thread godfrey he
Thanks for driving this discussion.

Do we really need to expose `PlanAnalyzerFactory` as public interface?
I prefer we only expose ExplainDetail#ANALYZED_PHYSICAL_PLAN and the
analyzed result.
Which is enough for users and consistent with the results of `explain` method.

The classes about plan analyzer are in table planner module, which
does not public api
(public interfaces should be defined in flink-table-api-java module).
And PlanAnalyzer is depend on RelNode, which is internal class of
planner, and not expose to users.

Bests,
Godfrey


Shengkai Fang  于2023年1月3日周二 13:43写道:
>
> Sorry for the missing answer about the configuration of the Analyzer. Users
> may don't need to configure this with SQL statements. In the SQL Gateway,
> users can configure the endpoints with the option `sql-gateway.endpoint.type`
> in the flink-conf.
>
> Best,
> Shengkai
>
> Shengkai Fang  于2023年1月3日周二 12:26写道:
>
> > Hi, Jane.
> >
> > Thanks for bringing this to the discussion. I have some questions about
> > the FLIP:
> >
> > 1. `PlanAnalyzer#analyze` uses the FlinkRelNode as the input. Could you
> > share some thoughts about the motivation? In my experience, users mainly
> > care about 2 things when they develop their job:
> >
> > a. Why their SQL can not work? For example, their streaming SQL contains
> > an OVER window but their ORDER key is not ROWTIME. In this case, we may
> > don't have a physical node or logical node because, during the
> > optimization, the planner has already thrown the exception.
> >
> > b. Many users care about whether their state is compatible after upgrading
> > their Flink version. In this case, I think the old execplan and the SQL
> > statement are the user's input.
> >
> > So, I think we should introduce methods like `PlanAnalyzer#analyze(String
> > sql)` and `PlanAnalyzer#analyze(String sql, ExecnodeGraph)` here.
> >
> > 2. I am just curious how other people add the rules to the Advisor. When
> > rules increases, all these rules should be added to the Flink codebase?
> > 3. How do users configure another advisor?
> >
> > Best,
> > Shengkai
> >
> >
> >
> > Jane Chan  于2022年12月28日周三 12:30写道:
> >
> >> Hi @yuxia, Thank you for reviewing the FLIP and raising questions.
> >>
> >> 1: Is the PlanAnalyzerFactory also expected to be implemented by users
> >> just
> >> > like DynamicTableSourceFactory or other factories? If so, I notice that
> >> in
> >> > the code of PlanAnalyzerManager#registerAnalyzers, the code is as
> >> follows:
> >> > FactoryUtil.discoverFactory(classLoader, PlanAnalyzerFactory.class,
> >> > StreamPlanAnalyzerFactory.STREAM_IDENTIFIER)); IIUC, it'll always find
> >> the
> >> > factory with the name StreamPlanAnalyzerFactory.STREAM_IDENTIFIER; Is
> >> it a
> >> > typo or by design ?
> >>
> >>
> >> This is a really good open question. For the short answer, yes, it is by
> >> design. I'll explain the consideration in more detail.
> >>
> >> The standard procedure to create a custom table source/sink is to
> >> implement
> >> the factory and the source/sink class. There is a strong 1v1 relationship
> >> between the factory and the source/sink.
> >>
> >> SQL
> >>
> >> DynamicTableSourceFactory
> >>
> >> Source
> >>
> >> create table … with (‘connector’ = ‘foo’)
> >>
> >> #factoryIdentifer.equals(“foo”)
> >>
> >> FooTableSource
> >>
> >>
> >> *Apart from that, the custom function module is another kind of
> >> implementation. The factory creates a collection of functions. This is a
> >> 1vN relationship between the factory and the functions.*
> >>
> >> SQL
> >>
> >> ModuleFactory
> >>
> >> Function
> >>
> >> load module ‘bar’
> >>
> >> #factoryIdentifier.equals(“bar”)
> >>
> >> A collection of functions
> >>
> >> Back to the plan analyzers, if we choose the first style, we also need to
> >> expose a new SQL syntax to users, like "CREATE ANALYZER foo WITH ..." to
> >> specify the factory identifier. But I think it is too heavy because an
> >> analyzer is an auxiliary tool to help users write better queries, and thus
> >> it should be exposed at the API level other than the user syntax level.
> >>
> >> As a result, I propose to follow the second style. Then we don't need to
> >> introduce new syntax to create analyzers. Let StreamPlanAnalyzerFactory be
> >> the default factory to create analyzers under the streaming mode, and the
> >> custom analyzers will register themselves in StreamPlanAnalyzerFactory.
> >>
> >> @Override
> >> public List createAnalyzers() {
> >> return Arrays.asList(
> >> FooAnalyzer.INSTANCE,
> >> BarAnalyzer.INSTANCE,
> >> ...);
> >> }
> >>
> >>
> >> 2: Is there any special reason make PlanAdvice be a final class? Would it
> >> > be better to make it an interface and we provide a default
> >> implementation?
> >> > My concern is some users may want have their own implementation for
> >> > PlanAdvice. But it may be overthinking. If you think it won't bring any
> >> > problem, I'm also fine with that.
> >>
> >>
> >> The reason 

Re: [VOTE] FLIP-275: Support Remote SQL Client Based on SQL Gateway

2022-12-20 Thread godfrey he
+1 (binding)

Best,
Godfrey

Hang Ruan  于2022年12月21日周三 15:21写道:
>
> +1 (non-binding)
>
> Best,
> Hang
>
> Paul Lam  于2022年12月20日周二 17:36写道:
>
> > +1 (non-binding)
> >
> > Best,
> > Paul Lam
> >
> > > 2022年12月20日 11:35,Shengkai Fang  写道:
> > >
> > > +1(binding)
> > >
> > > Best,
> > > Shengkai
> > >
> > > yu zelin  于2022年12月14日周三 20:41写道:
> > >
> > >> Hi, all,
> > >>
> > >> Thanks for all your feedbacks so far. Through the discussion on this
> > >> thread[1], I think we have came to a consensus, so I’d like to start a
> > >> vote on FLIP-275[2].
> > >>
> > >> The vote will last for at least 72 hours (Dec 19th, 13:00 GMT, excluding
> > >> weekend days) unless there is an objection or insufficient vote.
> > >>
> > >> Best,
> > >> Yu Zelin
> > >>
> > >> [1] https://lists.apache.org/thread/zpx64l0z91b0sz0scv77h0g13ptj4xxo
> > >> [2] https://cwiki.apache.org/confluence/x/T48ODg
> >
> >


Re: [DISCUSS] Release Flink 1.16.1

2022-12-20 Thread godfrey he
Hi Martijn,

Thank you for bringing this up.

About Lincoln mentioned 3 commits, +1 to pick them into 1.16.1.
AFAIK, several users have encountered this kind of data correctness
problem so far, they are waiting a fix release as soon as possible.

Best,
Godfrey

ConradJam  于2022年12月20日周二 15:08写道:

> Hi Martijn,
>
> FLINK-30116  After
> merge.Flink Web Ui Configuration Can't show it,I checked the data returned
> by the back end and there is no problem, but there is an error in the front
> end, as shown in the picture below, can someone take a look before release
> 1.16.1 ?
>
> [image: Pasted Graphic.png]
>
> [image: Pasted Graphic 1.png]
>
> Martijn Visser  于2022年12月16日周五 02:52写道:
>
>> Hi everyone,
>>
>> I would like to open a discussion about releasing Flink 1.16.1. We've
>> released Flink 1.16 at the end of October, but we already have 58 fixes
>> listed for 1.16.1, including a blocker [1] on the environment variables
>> and
>> a number of critical issues. Some of the critical issues are related to
>> the
>> bugs on the Sink API, on PyFlink and some correctness issues.
>>
>> There are also a number of open issues with a fixVersion set to 1.16.1, so
>> it would be good to understand what the community thinks of starting a
>> release or if there are some fixes that should be included with 1.16.1.
>>
>> Best regards,
>>
>> Martijn
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-30116
>>
>


Re: [DISCUSS] FLIP-275: Support Remote SQL Client Based on SQL Gateway

2022-12-06 Thread godfrey he
Hi, zeklin

>The CLI will use default print style for the non-query result.
Please make sure the print results of EXPLAIN/DESC/SHOW CREATE TABLE
commands are clear.

> We think it’s better to add the root cause to the ErrorResponseBody.
LGTM

Best,
Godfrey

yu zelin  于2022年12月6日周二 17:51写道:
>
> Hi, Godfrey
>
> Thanks for your feedback. Below is my thoughts about your questions.
>
> 1. About RowFormat.
> I agree to your opinion. So we decided to revert the RowFormat related changes
> and let the client to resolve the print format.
>
> 2. About ContentType
> I agree that the definition of the ContentType is not clear. But how to 
> define the
> statement type is another big question. So, we decided to only tell the query 
> result
> and non-query result apart. The CLI will use default print style for the 
> non-query
> result.
>
> 3. About ErrorHandling
> I think reuse the current ErrorResponseBody is good, but parse the root cause
> from the exception stack strings is quite hacking. We think it’s better to 
> add the
> root cause to the ErrorResponseBody.
>
> 4. About Runtime REST API Modifications
> I agree, too. This part is moved to the ‘Future Work’.
>
> Best,
> Yu Zelin
>
>
> > 2022年12月5日 18:33,godfrey he  写道:
> >
> > Hi Zelin,
> >
> > Thanks for driving this discussion.
> >
> > I have a few comments,
> >
> >> Add RowFormat to ResultSet to indicate the format of rows.
> > We should not require SqlGateway server to meet the display
> > requirements of a CliClient.
> > Because different CliClients may have different display style. The
> > server just need to response the data,
> > and the CliClient prints the result as needed. So RowFormat is not needed.
> >
> >> Add ContentType to ResultSet to indicate what kind of data the result 
> >> contains.
> > from my first sight, the values of ContentType are intersected, such
> > as: A select query will return QUERY_RESULT,
> > but it also has JOB_ID. OTHER is too ambiguous, I don't know which
> > kind of query will return OTHER.
> > I recommend returning the concrete type for each statement, such as
> > "CREATE TABLE" for "create table xx (...) with ()",
> > "SELECT" for "select * from xxx". The statement type can be maintained
> > in `Operation`s.
> >
> >> Error Handling
> > I think current design of error handling mechanism can meet the
> > requirement of CliClient, we can get the root cause from
> > the stack (see ErrorResponseBody#errors). If it becomes a common
> > requirement (for many clients) in the future,
> > we can introduce this interface.
> >
> >> Runtime REST API Modification for Local Client Migration
> > I think this part is over-engineered, this part belongs to optimization.
> > The client does not require very high performance, the current design
> > can already meet our needs.
> > If we find performance problems in the future, do such optimizations.
> >
> > Best,
> > Godfrey
> >
> > yu zelin  于2022年12月5日周一 11:11写道:
> >>
> >> Hi, Shammon
> >>
> >> Thanks for your feedback. I think it’s good to support jdbc-sdk. However,
> >> it's not supported in the gateway side yet. In my opinion, this FLIP is 
> >> more
> >> concerned with the SQL Client. How about put “supporting jdbc-sdk” in
> >> ‘Future Work’? We can discuss how to implement it in another thread.
> >>
> >> Best,
> >> Yu Zelin
> >>> 2022年12月2日 18:12,Shammon FY  写道:
> >>>
> >>> Hi zelin
> >>>
> >>> Thanks for driving this discussion.
> >>>
> >>> I notice that the sql-client will interact with sql-gateway by `REST
> >>> Client` in the `Executor` in the FLIP, how about introducing jdbc-sdk for
> >>> sql-gateway?
> >>>
> >>> Then the sql-client can connect the gateway with jdbc-sdk, on the other
> >>> hand, the other applications and tools such as jmeter can use the jdbc-sdk
> >>> to connect sql-gateway too.
> >>>
> >>> Best,
> >>> Shammon
> >>>
> >>>
> >>> On Fri, Dec 2, 2022 at 4:10 PM yu zelin  wrote:
> >>>
> >>>> Hi Jim,
> >>>>
> >>>> Thanks for your feedback!
> >>>>
> >>>>> Should this configuration be mentioned in the FLIP?
> >>>>
> >>>> Sure.
> >>>>
> >>>>> some way for the server to be able to

Re: [DISCUSS] FLIP-275: Support Remote SQL Client Based on SQL Gateway

2022-12-05 Thread godfrey he
Hi Zelin,

Thanks for driving this discussion.

I have a few comments,

> Add RowFormat to ResultSet to indicate the format of rows.
We should not require SqlGateway server to meet the display
requirements of a CliClient.
Because different CliClients may have different display style. The
server just need to response the data,
and the CliClient prints the result as needed. So RowFormat is not needed.

> Add ContentType to ResultSet to indicate what kind of data the result 
> contains.
from my first sight, the values of ContentType are intersected, such
as: A select query will return QUERY_RESULT,
but it also has JOB_ID. OTHER is too ambiguous, I don't know which
kind of query will return OTHER.
I recommend returning the concrete type for each statement, such as
"CREATE TABLE" for "create table xx (...) with ()",
"SELECT" for "select * from xxx". The statement type can be maintained
in `Operation`s.

>Error Handling
I think current design of error handling mechanism can meet the
requirement of CliClient, we can get the root cause from
the stack (see ErrorResponseBody#errors). If it becomes a common
requirement (for many clients) in the future,
we can introduce this interface.

>Runtime REST API Modification for Local Client Migration
I think this part is over-engineered, this part belongs to optimization.
The client does not require very high performance, the current design
can already meet our needs.
If we find performance problems in the future, do such optimizations.

Best,
Godfrey

yu zelin  于2022年12月5日周一 11:11写道:
>
> Hi, Shammon
>
> Thanks for your feedback. I think it’s good to support jdbc-sdk. However,
> it's not supported in the gateway side yet. In my opinion, this FLIP is more
> concerned with the SQL Client. How about put “supporting jdbc-sdk” in
> ‘Future Work’? We can discuss how to implement it in another thread.
>
> Best,
> Yu Zelin
> > 2022年12月2日 18:12,Shammon FY  写道:
> >
> > Hi zelin
> >
> > Thanks for driving this discussion.
> >
> > I notice that the sql-client will interact with sql-gateway by `REST
> > Client` in the `Executor` in the FLIP, how about introducing jdbc-sdk for
> > sql-gateway?
> >
> > Then the sql-client can connect the gateway with jdbc-sdk, on the other
> > hand, the other applications and tools such as jmeter can use the jdbc-sdk
> > to connect sql-gateway too.
> >
> > Best,
> > Shammon
> >
> >
> > On Fri, Dec 2, 2022 at 4:10 PM yu zelin  wrote:
> >
> >> Hi Jim,
> >>
> >> Thanks for your feedback!
> >>
> >>> Should this configuration be mentioned in the FLIP?
> >>
> >> Sure.
> >>
> >>> some way for the server to be able to limit the number of requests it
> >> receives.
> >> I’m sorry that this FLIP is dedicated in implementing the Remote mode, so
> >> we
> >> didn't consider much about this. I think the option is enough currently.
> >> I will add
> >> the improvement suggestions to the ‘Future Work’.
> >>
> >>> I wonder if two other options are possible
> >>
> >> To forward the raw format to gateway and then to client is possible. The
> >> raw
> >> results from sink is in ‘CollectResultIterator#bufferedResult’. First, we
> >> can find
> >> a way to get this result without wrapping it. Second, constructing a
> >> ‘InternalTypeInfo’.
> >> We can construct it using the schema information (data’s logical type).
> >> After
> >> construction, we can get the ’TypeSerializer’ to deserialize the raw
> >> result.
> >>
> >>
> >>
> >>
> >>> 2022年12月1日 04:54,Jim Hughes  写道:
> >>>
> >>> Hi Yu,
> >>>
> >>> Thanks for moving my comments to this thread!  Also, thank you for
> >>> answering my questions; it is helping me understand the SQL Gateway
> >>> better.
> >>>
> >>> 5.
>  Our idea is to introduce a new session option (like
> >>> 'sql-client.result.fetch-interval') to control
> >>> the fetching requests sending frequency. What do you think?
> >>>
> >>> Should this configuration be mentioned in the FLIP?
> >>>
> >>> One slight concern I have with having 'sql-client.result.fetch-interval'
> >> as
> >>> a session configuration is that users could set it low and cause the
> >> client
> >>> to send a large volume of requests to the SQL gateway.
> >>>
> >>> Generally, I'd like to see some way for the server to be able to limit
> >> the
> >>> number of requests it receives.  If that really needs to be done by a
> >> proxy
> >>> in front of the SQL gateway, that is fine as well.  (To be clear, I don't
> >>> think my concern here should be blocking in any way.)
> >>>
> >>> 7.
>  What is the serialization lifecycle for results?
> >>>
> >>> I wonder if two other options are possible:
> >>> 3) Could the Gateway just forward the result byte array?  (Or does the
> >>> Gateway need to deserialize the response in order to understand it for
> >> some
> >>> reason?)
> >>> 4) Could the JobManager prepare the results in JSON?  (Or similarly could
> >>> the Client read the format which the JobManager sends?)
> >>>
> >>> Thanks again!
> >>>
> >>> Cheers,
> >>>
> >>> Jim
> >>>
> >>> On Wed, Nov 

Re: [ANNOUNCE] New Apache Flink Committer - Matyas Orhidi

2022-11-21 Thread godfrey he
Congratulations, Matyas!

Matthias Pohl  于2022年11月22日周二 13:40写道:
>
> Congratulations, Matyas :)
>
> On Tue, Nov 22, 2022 at 11:44 AM Xingbo Huang  wrote:
>
> > Congrats Matyas!
> >
> > Best,
> > Xingbo
> >
> > Yanfei Lei  于2022年11月22日周二 11:18写道:
> >
> > > Congrats Matyas! 🍻
> > >
> > > Zheng Yu Chen  于2022年11月22日周二 11:15写道:
> > >
> > > > Congratulations ~ 🍻
> > > >
> > > > Márton Balassi  于2022年11月21日周一 22:18写道:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > On behalf of the PMC, I'm very happy to announce Matyas Orhidi as a
> > new
> > > > > Flink
> > > > > committer.
> > > > >
> > > > > Matyas has over a decade of experience of the Big Data ecosystem and
> > > has
> > > > > been working with Flink full time for the past 3 years. In the open
> > > > source
> > > > > community he is one of the key driving members of the Kubernetes
> > > Operator
> > > > > subproject. He implemented multiple key features in the operator
> > > > including
> > > > > the metrics system and the ability to dynamically configure watched
> > > > > namespaces. He enjoys spreading the word about Flink and regularly
> > does
> > > > so
> > > > > via authoring blogposts and giving talks or interviews representing
> > the
> > > > > community.
> > > > >
> > > > > Please join me in congratulating Matyas for becoming a Flink
> > committer!
> > > > >
> > > > > Best,
> > > > > Marton
> > > > >
> > > >
> > > >
> > > > --
> > > > Best
> > > >
> > > > ConradJam
> > > >
> > >
> > >
> > > --
> > > Best,
> > > Yanfei
> > >
> >


[jira] [Created] (FLINK-29981) Improve WatermarkAssignerChangelogNormalizeTransposeRule

2022-11-10 Thread godfrey he (Jira)
godfrey he created FLINK-29981:
--

 Summary: Improve WatermarkAssignerChangelogNormalizeTransposeRule
 Key: FLINK-29981
 URL: https://issues.apache.org/jira/browse/FLINK-29981
 Project: Flink
  Issue Type: Improvement
  Components: Table SQL / Planner
Reporter: godfrey he


 

WatermarkAssignerChangelogNormalizeTransposeRule is too complex to maintain. 
It's better we can do some improvement, such as splitting 
WatermarkAssignerChangelogNormalizeTransposeRule into two rules



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Re: [VOTE] FLIP-263: Improve resolving schema compatibility

2022-10-27 Thread godfrey he
+1 (binding)

Thanks for driving this!

Best,
Godfrey

Yun Gao  于2022年10月28日周五 13:50写道:
>
> +1 (binding)
>
> Thanks Hangxiang for driving the FLIP.
>
> Best,
> Yun Gao
>
>
>
>
>  --Original Mail --
> Sender:Zakelly Lan 
> Send Date:Fri Oct 28 12:27:01 2022
> Recipients:Flink Dev 
> Subject:Re: [VOTE] FLIP-263: Improve resolving schema compatibility
> Hi Hangxiang,
>
> The current plan looks good to me, +1 (non-binding). Thanks for driving this.
>
> Best,
> Zakelly
>
> On Fri, Oct 28, 2022 at 11:18 AM Yuan Mei  wrote:
> >
> > +1 (binding)
> >
> > Thanks for driving this.
> >
> > Best
> > Yuan
> >
> > On Fri, Oct 28, 2022 at 11:17 AM yanfei lei  wrote:
> >
> > > +1(non-binding) and thanks for Hangxiang's driving.
> > >
> > >
> > >
> > > Hangxiang Yu  于2022年10月28日周五 09:24写道:
> > >
> > > > Hi everyone,
> > > >
> > > > I'd like to start the vote for FLIP-263 [1].
> > > >
> > > > Thanks for your feedback and the discussion in [2][3].
> > > >
> > > > The vote will be open for at least 72 hours.
> > > >
> > > > Best regards,
> > > > Hangxiang.
> > > >
> > > > [1]
> > > >
> > > >
> > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-263%3A+Improve+resolving+schema+compatibility
> > > > [2] https://lists.apache.org/thread/4w36oof8dh28b9f593sgtk21o8qh8qx4
> > > >
> > > > [3] https://lists.apache.org/thread/t0bdkx1161rlbnsf06x0kswb05mch164
> > > >
> > >
> > >
> > > --
> > > Best,
> > > Yanfei
> > >


Re: [VOTE] Release 1.16.0, release candidate #2

2022-10-20 Thread godfrey he
+1 to make 1.16.0 released as soon as possible,
it's been more than two months since feature freeze,
1.17 already starts kicking off. We can fix the critical bugs in 1.16.1.

Best,
Godfrey

Xintong Song  于2022年10月21日周五 09:57写道:
>
> BTW, missing 1.16.0 is probably not that bad. From my experience, the x.y.0
> releases are usually considered unstable and are mainly for trying out
> purposes. Most users do not upgrade to the new version in production until
> the x.y.1/2 releases, which are considered more stable. As this bug-fix is
> making 1.16.1 anyway, it barely misses anything.
>
> Best,
>
> Xintong
>
>
>
> On Fri, Oct 21, 2022 at 5:10 AM Danny Cranmer 
> wrote:
>
> > We had a similar situation for Flink 1.15.1 where a non-regression
> > "critical" bug was impacting a connector [1]. We decided to not block the
> > release to address this issue. Based on this, I am inclined to agree with
> > Martijn and move forward with the release. This bug is not marked as a
> > "blocker" and we should respect that.
> >
> > I am happy to manage a 1.15.3 or 1.16.1 release follow up to address these
> > concerns.
> >
> > [1] https://lists.apache.org/thread/8hxz5n6x2v4c6v8z33hdz51h1bhn768d
> >
> > Thanks,
> > Danny
> >
> > On Thu, Oct 20, 2022 at 5:01 PM Martijn Visser 
> > wrote:
> >
> > > Taking in this fix would require us to cancel RC2 and create another
> > > release candidate. We are already long-overdue on the Flink 1.16 release.
> > > Given that 1.15.3 is not yet released, it can't be a regression compared
> > to
> > > the current situation of 1.15.2. The Flink Delta connector is not part of
> > > the ASF Flink community so it can't be considered a blocker from a ASF
> > > Flink community perspective.
> > >
> > > I do understand the pain. I'm curious what others think if this is worthy
> > > of cancelling the release candidate.
> > >
> > > Thanks, Martijn
> > >
> > > On Thu, Oct 20, 2022 at 4:54 PM Krzysztof Chmielewski <
> > > krzysiek.chmielew...@gmail.com> wrote:
> > >
> > > > Thank you all for response,
> > > > however i think you may miss a bigger context regarding those 3
> > tickets.
> > > >
> > > > Those 3 tickets [29509, 29512, 29627] are part of a bigger thing. They
> > > are
> > > > fixing 1.15 Sink V2 issue, where Task manager will not start after
> > > recovery
> > > > for Sink topology with Global Committer. The problem was described by
> > me
> > > in
> > > > this thread [1]. We need all three to fix the problem.
> > > >
> > > > All three tickets were merged into 1.15 release branch and will be
> > > included
> > > > in 1.15.3 probably, however 1.16 will be missing one fix (29627).
> > > > In other words, there will be a regression between 1.16 and 1.15.3.
> > > >
> > > > Additionally for now this issue is blocking Flink migration for Delta
> > > > connector [2].
> > > > We need to migrate because Flink 1.14 has another Sink problem with
> > data
> > > > loss during Sink Recovery with Global Committer [3] and this one most
> > > > likely will not be fixed since 1.14 support is ending.
> > > >
> > > > Forgive my if I'm wrong but what do you mean by " we won't block 1.16.0
> > > on
> > > > this." Fix is merged so couldn't we just cherry pick 1.16 merge commit
> > to
> > > > 1.16.0's RC2?
> > > >
> > > > [1] https://lists.apache.org/thread/otscy199g1l9t3llvo8s2slntyn2r1jc
> > > > [2] https://github.com/delta-io/connectors/tree/master/flink
> > > > [3] https://issues.apache.org/jira/browse/FLINK-29589
> > > >
> > > > Regards,
> > > > Krzysztof Chmielewski
> > > >
> > > > czw., 20 paź 2022 o 16:13 Xingbo Huang  napisał(a):
> > > >
> > > > > Hi Krzysztof,
> > > > >
> > > > > When I was building rc2, I tried to search whether issues with `fix
> > > > > version` of 1.16.0 have not been closed.
> > > > > https://issues.apache.org/jira/browse/FLINK-29627 was missed because
> > > the
> > > > > `fix version` was not marked. I agree with Martijn and Xintong that
> > we
> > > > > won't block 1.16.0 on this.
> > > > >
> > > > > Best,
> > > > > Xingbo
> > > > >
> > > > > Xintong Song  于2022年10月20日周四 18:23写道:
> > > > >
> > > > > > Hi Krzysztof,
> > > > > >
> > > > > > FLINK-29627 is merged after rc2 being created, that's why it
> > doesn't
> > > > > appear
> > > > > > in the change list. See the commit history of rc2 [1].
> > > > > >
> > > > > > It's unfortunate this fix didn't make the 1.16.0 release (if rc2 is
> > > > > > approved). However, I agree with Martijn that we should not further
> > > > block
> > > > > > 1.16.0 on this. If there's no other blockers discovered in the rc2,
> > > the
> > > > > fix
> > > > > > of FLINK-29627 will be shipped in 1.16.1.
> > > > > >
> > > > > > Best,
> > > > > >
> > > > > > Xintong
> > > > > >
> > > > > >
> > > > > > [1] https://github.com/apache/flink/commits/release-1.16.0-rc2
> > > > > >
> > > > > > On Thu, Oct 20, 2022 at 6:16 PM Krzysztof Chmielewski <
> > > > > > krzysiek.chmielew...@gmail.com> wrote:
> > > > > >
> > > > > > > Thanks  Martijn,
> > > > > > > just to clarify

Re: [DISCUSS] FLIP-263: Improve resolving schema compatibility

2022-10-20 Thread godfrey he
Hi Hangxiang, Dawid,

I also prefer to add method into TypeSerializerSnapshot, which looks
more natural. TypeSerializerSnapshot has `Version` concept, which also
can be used for compatibility.
`
TypeSerializerSnapshot {

TypeSerializerSchemaCompatibility resolveSchemaCompatibility(
 TypeSerializerSnapshot oldSnapshot);

}
`

Best,
Godfrey

Dawid Wysakowicz  于2022年10月20日周四 16:59写道:
>
> That's the final status we will arrive at.
>
> IIUC, we cannot just remove the original method but just mark it as 
> deprecated so two methods have to exist together in the short term.
>
> Users may also need to migrate their own external serializers which is a long 
> run.
>
> I'd like to make jobs where users could always work without modifying any 
> codes before removing the deprecated method so I provide a default 
> implementation in the proposal.
>
>
> I understand it works for old serializers, but it makes writing new 
> serializers more complicated. It's not as simple as extending an interface 
> and implementing all the methods it tells you to. You have to dig into 
> documentation and check that you must implement one of the methods (either 
> the old or the new one). Otherwise you end up in an ifinite loop.
>
>
> If we break the compatibility, yes we cause some annoyance, but I feel it's a 
> safer option. BTW, there is a proposal around serializer that also causes 
> some incompatibilities so we could squash this together: 
> https://lists.apache.org/thread/v1q28zg5jhxcqrpq67pyv291nznd3n0w
>
>
> I don't want to force that, but would be really happy to hear what others 
> think.
>
>
> Best,
>
> Dawid
>
> On 19/10/2022 04:05, Hangxiang Yu wrote:
>
> Hi, Dawid.
>
>
> Thanks for your reply.
>
> Should we introduce the new method to the TypeSerializerSnapshot instead?
>
> You provided a valuable option, let me try to list the benefit and cost.
>
> benefit:
>
> TypeSerializerSnapshot still owns the responsibility of resolving schema 
> compatibility so TypeSerializer could just pay attention to its serialization 
> and deserialization as before.
> It's very convenient to implement it based on current implementation by all 
> information in TypeSerializerSnapshot and tools which is also helpful for 
> users to migrate their external serializer.
>
> cost:
>
> The interface is a bit strange because we have to construct and use the 
> snapshot of the new serializer to check the compatibility.
>
> I appreciate this because users could migrate more conveniently.
>
> For its cost, We could see TypeSerializerSnapshot as a point-in-time view of 
> TypeSerializer used for forward compatibility checks and persisting 
> serializer within checkpoints. It seems reasonable to generate a 
> point-in-time view of the serializer when we need to resolve compatibility.
>
>
> I'd actually be fine with breaking some external serializers and not provide 
> a default implementation of the new method.
>
> That's the final status we will arrive at.
>
> IIUC, we cannot just remove the original method but just mark it as 
> deprecated so two methods have to exist together in the short term.
>
> Users may also need to migrate their own external serializers which is a long 
> run.
>
> I'd like to make jobs where users could always work without modifying any 
> codes before removing the deprecated method so I provide a default 
> implementation in the proposal.
>
>
> On Mon, Oct 17, 2022 at 10:10 PM Dawid Wysakowicz  
> wrote:
>>
>> Hi Han,
>>
>> I think in principle your proposal makes sense and the compatibility
>> check indeed should be done in the opposite direction.
>>
>> However, I have two suggestions:
>>
>> 1. Should we introduce the new method to the TypeSerializerSnapshot
>> instead? E.g.
>>
>> TypeSerializerSnapshot {
>>
>> TypeSerializerSchemaCompatibility resolveSchemaCompatibility(
>>  TypeSerializerSnapshot oldSnapshot);
>>
>> }
>>
>> I know it has the downside that we'd need to create a snapshot for the
>> new serializer during a restore, but the there is a lot of tooling and
>> subclasses in the *Snapshot stack that it won't be possible to migrate
>> to TypeSerializer stack. I have classes such as
>> CompositeTypeSerializerSnapshot and alike in mind.
>>
>> 2. I'd actually be fine with breaking some external serializers and not
>> provide a default implementation of the new method. If we don't do that
>> we will have two methods with default implementation which call each
>> other. It makes it also a bit harder to figure out which methods are
>> mandatory to be implemented going forward.
>>
>> What do you think?
>>
>> Best,
>>
>> Dawid
>>
>> On 17/10/2022 04:43, Hangxiang Yu wrote:
>> > Hi, Han.
>> >
>> > Both the old method and the new method can get previous and new inner
>> > information.
>> >
>> > The new serializer will decide it just like the old serializer did before.
>> >
>> > The method just specify the schema compatibility result so that other
>> > behaviours is same as b

[jira] [Created] (FLINK-28939) Release Testing: Verify FLIP-241 ANALYZE TABLE

2022-08-11 Thread godfrey he (Jira)
godfrey he created FLINK-28939:
--

 Summary: Release Testing: Verify FLIP-241 ANALYZE TABLE
 Key: FLINK-28939
 URL: https://issues.apache.org/jira/browse/FLINK-28939
 Project: Flink
  Issue Type: Sub-task
Reporter: godfrey he
 Fix For: 1.16.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-28940) Release Testing: Verify FLIP-248 Dynamic Partition Prunning

2022-08-11 Thread godfrey he (Jira)
godfrey he created FLINK-28940:
--

 Summary: Release Testing: Verify FLIP-248 Dynamic Partition 
Prunning
 Key: FLINK-28940
 URL: https://issues.apache.org/jira/browse/FLINK-28940
 Project: Flink
  Issue Type: Sub-task
Reporter: godfrey he
 Fix For: 1.16.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[ANNOUNCE] Flink 1.16 Feature Freeze

2022-08-09 Thread godfrey he
Hi everyone,

The deadline for merging new features for Flink 1.16 has passed.

* From now on, only bug-fixes and documentation fixes / improvements are
allowed to be merged into the master branch.

* New features merged after this point can be reverted. If you need an
exception to this rule, please open a discussion on dev@ list and reach out
to us.

We plan to wait for the master branch to get a bit more stabilized before
cutting the "release-1.16" branch, in order to reduce the overhead of
having to manage two branches. That also means potentially delaying merging
new features for Flink 1.17 into the master branch. If you are blocked on
this, please let us know and we can come up with a compromise for the
branch cutting time.

What you can do to help with the release testing phase:

* The first release testing sync will be on *the 16th of August at 9am
CEST / 3pm China Standard Time*.
Everyone is welcome to join. The link can be found on the release wiki page
[1].

* Please prepare for the release testing by creating Jira tickets for
documentation and testing tasks for the new features under the
umbrella issue[2].
Tickets should be opened with Priority Blocker, FixVersion 1.16.0 and Label
release-testing (testing tasks only). It is greatly appreciated if
you can help to verify the new features.

* There are currently 55 test-stability issues affecting the 1.16.0 release
[3]. It is also greatly appreciated if you can help address some of them.

Best,
Martijn, Chesnay, Xingbo & Godfrey

[1] https://cwiki.apache.org/confluence/display/FLINK/1.16+Release
[2] https://issues.apache.org/jira/browse/FLINK-28896
[3] https://issues.apache.org/jira/issues/?filter=12352149


[jira] [Created] (FLINK-28866) Use DDL instead of legacy method to register the test source in JoinITCase

2022-08-08 Thread godfrey he (Jira)
godfrey he created FLINK-28866:
--

 Summary: Use DDL instead of legacy method to register the test 
source in JoinITCase
 Key: FLINK-28866
 URL: https://issues.apache.org/jira/browse/FLINK-28866
 Project: Flink
  Issue Type: Improvement
  Components: Table SQL / Planner
Reporter: godfrey he






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-28859) FAILED pyflink/datastream/connectors/tests/test_file_system.py::FileSinkCsvBulkWriterTests::test_csv_customize_quote_char_write

2022-08-07 Thread godfrey he (Jira)
godfrey he created FLINK-28859:
--

 Summary: FAILED 
pyflink/datastream/connectors/tests/test_file_system.py::FileSinkCsvBulkWriterTests::test_csv_customize_quote_char_write
 Key: FLINK-28859
 URL: https://issues.apache.org/jira/browse/FLINK-28859
 Project: Flink
  Issue Type: Bug
Affects Versions: 1.16.0
Reporter: godfrey he
 Fix For: 1.16.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-28858) Add document to describe join hints for batch sql

2022-08-07 Thread godfrey he (Jira)
godfrey he created FLINK-28858:
--

 Summary: Add document to describe join hints for batch sql
 Key: FLINK-28858
 URL: https://issues.apache.org/jira/browse/FLINK-28858
 Project: Flink
  Issue Type: Sub-task
Reporter: godfrey he






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-28821) Adjust join cost for dpp query pattern which could help more plans use dpp

2022-08-04 Thread godfrey he (Jira)
godfrey he created FLINK-28821:
--

 Summary: Adjust join cost for dpp query pattern which could help 
more plans use dpp
 Key: FLINK-28821
 URL: https://issues.apache.org/jira/browse/FLINK-28821
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / Planner
Reporter: godfrey he
 Fix For: 1.16.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-28787) rename getUniqueKeys to getUpsertKeys in CommonPhysicalJoin

2022-08-03 Thread godfrey he (Jira)
godfrey he created FLINK-28787:
--

 Summary: rename getUniqueKeys to getUpsertKeys in 
CommonPhysicalJoin
 Key: FLINK-28787
 URL: https://issues.apache.org/jira/browse/FLINK-28787
 Project: Flink
  Issue Type: Improvement
  Components: Table SQL / Planner
Reporter: godfrey he






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-28753) Improve FilterIntoJoinRule which could push some predicate to another side

2022-07-30 Thread godfrey he (Jira)
godfrey he created FLINK-28753:
--

 Summary: Improve FilterIntoJoinRule which could push some 
predicate to another side
 Key: FLINK-28753
 URL: https://issues.apache.org/jira/browse/FLINK-28753
 Project: Flink
  Issue Type: Improvement
  Components: Table SQL / Planner
Reporter: godfrey he
 Fix For: 1.16.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[RESULT][VOTE] FLIP-248: Introduce dynamic partition pruning

2022-07-29 Thread godfrey he
Hi, everyone.

FLIP-248: Introduce dynamic partition pruning[1] has been accepted.

There are 4 binding votes, 1 non-binding votes[2].
- Jing Ge(non-binging)
- Yun Gao(binding)
- Jark Wu(binding)
- Jingsong Li(binding)
- Jing Zhang(binding)

None against.

Thanks again for every one who concerns on this FLIP.


[1] 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-248%3A+Introduce+dynamic+partition+pruning
[2] https://lists.apache.org/thread/tr3jdwwv7p45f53q1xj85o2xcbr6fbcb


Best,
Godfrey


[jira] [Created] (FLINK-28726) CompileException: BoundedOverAggregateHelper must implement method AggsHandeFunction.setWindowSize(int)

2022-07-27 Thread godfrey he (Jira)
godfrey he created FLINK-28726:
--

 Summary: CompileException: BoundedOverAggregateHelper must 
implement method AggsHandeFunction.setWindowSize(int)
 Key: FLINK-28726
 URL: https://issues.apache.org/jira/browse/FLINK-28726
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / Runtime
Affects Versions: 1.16.0
Reporter: godfrey he
 Fix For: 1.16.0
 Attachments: image-2022-07-28-14-52-08-729.png

 !image-2022-07-28-14-52-08-729.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-28711) Hive connector implements SupportsDynamicFiltering interface

2022-07-27 Thread godfrey he (Jira)
godfrey he created FLINK-28711:
--

 Summary: Hive connector implements SupportsDynamicFiltering 
interface
 Key: FLINK-28711
 URL: https://issues.apache.org/jira/browse/FLINK-28711
 Project: Flink
  Issue Type: Sub-task
  Components: Connectors / Hive
Reporter: godfrey he
 Fix For: 1.16.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-28710) Transform dpp ExecNode to StreamGraph

2022-07-27 Thread godfrey he (Jira)
godfrey he created FLINK-28710:
--

 Summary: Transform dpp ExecNode to StreamGraph
 Key: FLINK-28710
 URL: https://issues.apache.org/jira/browse/FLINK-28710
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / Planner
Reporter: godfrey he






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-28709) Implement dynamic filtering operators

2022-07-27 Thread godfrey he (Jira)
godfrey he created FLINK-28709:
--

 Summary: Implement dynamic filtering operators
 Key: FLINK-28709
 URL: https://issues.apache.org/jira/browse/FLINK-28709
 Project: Flink
  Issue Type: Sub-task
Reporter: godfrey he






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-28708) Introduce planner rules to optimize dpp pattern

2022-07-27 Thread godfrey he (Jira)
godfrey he created FLINK-28708:
--

 Summary: Introduce planner rules to optimize dpp pattern
 Key: FLINK-28708
 URL: https://issues.apache.org/jira/browse/FLINK-28708
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / Planner
Reporter: godfrey he
 Fix For: 1.16.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-28707) Introduce interface SupportsDynamicPartitionPruning

2022-07-27 Thread godfrey he (Jira)
godfrey he created FLINK-28707:
--

 Summary: Introduce interface SupportsDynamicPartitionPruning
 Key: FLINK-28707
 URL: https://issues.apache.org/jira/browse/FLINK-28707
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / API
Affects Versions: 1.16.0
Reporter: godfrey he
 Fix For: 1.16.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-28706) FLIP-248: Introduce dynamic partition pruning

2022-07-27 Thread godfrey he (Jira)
godfrey he created FLINK-28706:
--

 Summary: FLIP-248: Introduce dynamic partition pruning
 Key: FLINK-28706
 URL: https://issues.apache.org/jira/browse/FLINK-28706
 Project: Flink
  Issue Type: New Feature
  Components: Connectors / Hive, Runtime / Coordination, Table SQL / 
Planner, Table SQL / Runtime
Affects Versions: 1.16.0
Reporter: godfrey he


Please refer to 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-248%3A+Introduce+dynamic+partition+pruning
 for more details



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[VOTE] FLIP-248: Introduce dynamic partition pruning

2022-07-26 Thread godfrey he
Hi everyone,

Thanks for all the feedback so far. Based on the discussion[1] we seem
to have consensus, so I would like to start a vote on FLIP-248 for
which the FLIP has now also been updated[2].

The vote will last for at least 72 hours (Jul 29th 14:00 GMT) unless
there is an objection or insufficient votes.

[1] https://lists.apache.org/thread/v0b8pfh0o7rwtlok2mfs5s6q9w5vw8h6
[2] 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-248%3A+Introduce+dynamic+partition+pruning

Best,
Godfrey


Re: [DISCUSS] FLIP-248: Introduce dynamic partition pruning

2022-07-26 Thread godfrey he
Thanks for your confirmation.
I will start the vote.

Best,
Godfrey

Jing Zhang  于2022年7月26日周二 19:24写道:
>
> Hi, Godfrey
> Thanks for updating the FLIP.
> It looks good to me now.
>
> Best,
> Jing Zhang
>
> godfrey he  于2022年7月26日周二 12:33写道:
>
> > Thanks for all the inputs, I have updated the document and POC code.
> >
> >
> > Best,
> > Godfrey
> >
> > Yun Gao  于2022年7月26日周二 11:11写道:
> > >
> > > Hi,
> > >
> > > Thanks all for all the valuable discussion on this FLIP, +1 for
> > implementing
> > > dynamic partition pruning / dynamic filtering pushdown since it is a key
> > optimization
> > > to improve the performance on batch processing.
> > >
> > > Also due to introducing the speculative execution for the batch
> > processing, we
> > > might also need some consideration for the case with speculative
> > execution enabled:
> > > 1. The operator coordinator of DynamicFilteringDataCollector should
> > ignore the following
> > > filtering data in consider of the task might executes for multiple
> > attempts.
> > > 2. The DynamicFileSplitEnumerator should also implements the
> > `SupportsHandleExecutionAttemptSourceEvent`
> > > interface, otherwise it would throws exception when received the
> > filtering data source event.
> > >
> > > Best,
> > > Yun Gao
> > >
> > >
> > >
> > > [1]
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+Execution+for+Batch+Job
> > >
> > >
> > >
> > > --
> > > From:Jing Ge 
> > > Send Time:2022 Jul. 21 (Thu.) 18:56
> > > To:dev 
> > > Subject:Re: [DISCUSS] FLIP-248: Introduce dynamic partition pruning
> > >
> > > Hi,
> > >
> > > Thanks for the informative discussion! Looking forward to using dynamic
> > > filtering provided by Flink.
> > >
> > > Best regards,
> > > Jing
> > >
> > > On Tue, Jul 19, 2022 at 3:22 AM godfrey he  wrote:
> > >
> > > > Hi, Jingong, Jark, Jing,
> > > >
> > > > Thanks for for the important inputs.
> > > > Lake storage is a very important scenario, and consider more generic
> > > > and extended case,
> > > > I also would like to use "dynamic filtering" concept instead of
> > > > "dynamic partition".
> > > >
> > > > >maybe the FLIP should also demonstrate the EXPLAIN result, which
> > > > is also an API.
> > > > I will add a section to describe the EXPLAIN result.
> > > >
> > > > >Does DPP also support streaming queries?
> > > > Yes, but for bounded source.
> > > >
> > > > >it requires the SplitEnumerator must implements new introduced
> > > > `SupportsHandleExecutionAttemptSourceEvent` interface,
> > > > +1
> > > >
> > > > I will update the document and the poc code.
> > > >
> > > > Best,
> > > > Godfrey
> > > >
> > > > Jing Zhang  于2022年7月13日周三 20:22写道:
> > > > >
> > > > > Hi Godfrey,
> > > > > Thanks for driving this discussion.
> > > > > This is an important improvement for batch sql jobs.
> > > > > I agree with Jingsong to expand the capability to more than just
> > > > partitions.
> > > > > Besides, I have two points:
> > > > > 1. Based on FLIP-248[1],
> > > > >
> > > > > > Dynamic partition pruning mechanism can improve performance by
> > avoiding
> > > > > > reading large amounts of irrelevant data, and it works for both
> > batch
> > > > and
> > > > > > streaming queries.
> > > > >
> > > > > Does DPP also support streaming queries?
> > > > > It seems the proposed changes in the FLIP-248 does not work for
> > streaming
> > > > > queries,
> > > > > because the dimension table might be an unbounded inputs.
> > > > > Or does it require all dimension tables to be bounded inputs for
> > > > streaming
> > > > > jobs if the job wanna enable DPP?
> > > > >
> > > > > 2. I notice there are changes on SplitEnumerator for Hive source and
> > File
> > > > > source.
> > > > > And they now depen

Re: [DISCUSS] FLIP-248: Introduce dynamic partition pruning

2022-07-25 Thread godfrey he
Thanks for all the inputs, I have updated the document and POC code.


Best,
Godfrey

Yun Gao  于2022年7月26日周二 11:11写道:
>
> Hi,
>
> Thanks all for all the valuable discussion on this FLIP, +1 for implementing
> dynamic partition pruning / dynamic filtering pushdown since it is a key 
> optimization
> to improve the performance on batch processing.
>
> Also due to introducing the speculative execution for the batch processing, we
> might also need some consideration for the case with speculative execution 
> enabled:
> 1. The operator coordinator of DynamicFilteringDataCollector should ignore 
> the following
> filtering data in consider of the task might executes for multiple attempts.
> 2. The DynamicFileSplitEnumerator should also implements the 
> `SupportsHandleExecutionAttemptSourceEvent`
> interface, otherwise it would throws exception when received the filtering 
> data source event.
>
> Best,
> Yun Gao
>
>
>
> [1] 
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+Execution+for+Batch+Job
>
>
>
> --
> From:Jing Ge 
> Send Time:2022 Jul. 21 (Thu.) 18:56
> To:dev 
> Subject:Re: [DISCUSS] FLIP-248: Introduce dynamic partition pruning
>
> Hi,
>
> Thanks for the informative discussion! Looking forward to using dynamic
> filtering provided by Flink.
>
> Best regards,
> Jing
>
> On Tue, Jul 19, 2022 at 3:22 AM godfrey he  wrote:
>
> > Hi, Jingong, Jark, Jing,
> >
> > Thanks for for the important inputs.
> > Lake storage is a very important scenario, and consider more generic
> > and extended case,
> > I also would like to use "dynamic filtering" concept instead of
> > "dynamic partition".
> >
> > >maybe the FLIP should also demonstrate the EXPLAIN result, which
> > is also an API.
> > I will add a section to describe the EXPLAIN result.
> >
> > >Does DPP also support streaming queries?
> > Yes, but for bounded source.
> >
> > >it requires the SplitEnumerator must implements new introduced
> > `SupportsHandleExecutionAttemptSourceEvent` interface,
> > +1
> >
> > I will update the document and the poc code.
> >
> > Best,
> > Godfrey
> >
> > Jing Zhang  于2022年7月13日周三 20:22写道:
> > >
> > > Hi Godfrey,
> > > Thanks for driving this discussion.
> > > This is an important improvement for batch sql jobs.
> > > I agree with Jingsong to expand the capability to more than just
> > partitions.
> > > Besides, I have two points:
> > > 1. Based on FLIP-248[1],
> > >
> > > > Dynamic partition pruning mechanism can improve performance by avoiding
> > > > reading large amounts of irrelevant data, and it works for both batch
> > and
> > > > streaming queries.
> > >
> > > Does DPP also support streaming queries?
> > > It seems the proposed changes in the FLIP-248 does not work for streaming
> > > queries,
> > > because the dimension table might be an unbounded inputs.
> > > Or does it require all dimension tables to be bounded inputs for
> > streaming
> > > jobs if the job wanna enable DPP?
> > >
> > > 2. I notice there are changes on SplitEnumerator for Hive source and File
> > > source.
> > > And they now depend on SourceEvent to pass PartitionData.
> > > In FLIP-245, if enable speculative execution for sources based on FLIP-27
> > > which use SourceEvent,
> > > it requires the SplitEnumerator must implements new introduced
> > > `SupportsHandleExecutionAttemptSourceEvent` interface,
> > > otherwise an exception would be thrown out.
> > > Since hive and File sources are commonly used for batch jobs, it's better
> > > to take this point into consideration.
> > >
> > > Best,
> > > Jing Zhang
> > >
> > > [1] FLIP-248:
> > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-248%3A+Introduce+dynamic+partition+pruning
> > > [2] FLIP-245:
> > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-245%3A+Source+Supports+Speculative+Execution+For+Batch+Job
> > >
> > >
> > > Jark Wu  于2022年7月12日周二 13:16写道:
> > >
> > > > I agree with Jingsong. DPP is a particular case of Dynamic Filter
> > Pushdown
> > > > that the join key contains partition fields.  Extending this FLIP to
> > > > general filter
> > > > pushdown can benefit more optimizations, and they can share the

Re: [VOTE] FLIP-247 Bulk fetch of table and column statistics for given partitions

2022-07-25 Thread godfrey he
+1

Best,
Godfrey

Jark Wu  于2022年7月25日周一 17:23写道:
>
> +1 (binding)
>
> Best,
> Jark
>
> On Mon, 25 Jul 2022 at 15:10, Jing Ge  wrote:
>
> > Hi all,
> >
> > Many thanks for all your feedback. Based on the discussion in [1], I'd like
> > to start a vote on FLIP-247 [2].
> >
> > The vote will last for at least 72 hours unless there is an objection or
> > insufficient votes.
> >
> > Best regards,
> > Jing
> >
> > [1] https://lists.apache.org/thread/sgd36d8s8crc822xt57jxvb6m1k6t07o
> > [2]
> >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-247%3A+Bulk+fetch+of+table+and+column+statistics+for+given+partitions
> >


Re: [DISCUSS] FLIP-248: Introduce dynamic partition pruning

2022-07-18 Thread godfrey he
Hi, Jingong, Jark, Jing,

Thanks for for the important inputs.
Lake storage is a very important scenario, and consider more generic
and extended case,
I also would like to use "dynamic filtering" concept instead of
"dynamic partition".

>maybe the FLIP should also demonstrate the EXPLAIN result, which
is also an API.
I will add a section to describe the EXPLAIN result.

>Does DPP also support streaming queries?
Yes, but for bounded source.

>it requires the SplitEnumerator must implements new introduced
`SupportsHandleExecutionAttemptSourceEvent` interface,
+1

I will update the document and the poc code.

Best,
Godfrey

Jing Zhang  于2022年7月13日周三 20:22写道:
>
> Hi Godfrey,
> Thanks for driving this discussion.
> This is an important improvement for batch sql jobs.
> I agree with Jingsong to expand the capability to more than just partitions.
> Besides, I have two points:
> 1. Based on FLIP-248[1],
>
> > Dynamic partition pruning mechanism can improve performance by avoiding
> > reading large amounts of irrelevant data, and it works for both batch and
> > streaming queries.
>
> Does DPP also support streaming queries?
> It seems the proposed changes in the FLIP-248 does not work for streaming
> queries,
> because the dimension table might be an unbounded inputs.
> Or does it require all dimension tables to be bounded inputs for streaming
> jobs if the job wanna enable DPP?
>
> 2. I notice there are changes on SplitEnumerator for Hive source and File
> source.
> And they now depend on SourceEvent to pass PartitionData.
> In FLIP-245, if enable speculative execution for sources based on FLIP-27
> which use SourceEvent,
> it requires the SplitEnumerator must implements new introduced
> `SupportsHandleExecutionAttemptSourceEvent` interface,
> otherwise an exception would be thrown out.
> Since hive and File sources are commonly used for batch jobs, it's better
> to take this point into consideration.
>
> Best,
> Jing Zhang
>
> [1] FLIP-248:
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-248%3A+Introduce+dynamic+partition+pruning
> [2] FLIP-245:
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-245%3A+Source+Supports+Speculative+Execution+For+Batch+Job
>
>
> Jark Wu  于2022年7月12日周二 13:16写道:
>
> > I agree with Jingsong. DPP is a particular case of Dynamic Filter Pushdown
> > that the join key contains partition fields.  Extending this FLIP to
> > general filter
> > pushdown can benefit more optimizations, and they can share the same
> > interface.
> >
> > For example, Trino Hive Connector leverages dynamic filtering to support:
> > - dynamic partition pruning for partitioned tables
> > - and dynamic bucket pruning for bucket tables
> > - and dynamic filter pushed into the ORC and Parquet readers to perform
> > stripe
> >   or row-group pruning and save on disk I/O.
> >
> > Therefore, +1 to extend this FLIP to Dynamic Filter Pushdown (or Dynamic
> > Filtering),
> > just like Trino [1].  The interfaces should also be adapted for that.
> >
> > Besides, maybe the FLIP should also demonstrate the EXPLAIN result, which
> > is also an API.
> >
> > Best,
> > Jark
> >
> > [1]: https://trino.io/docs/current/admin/dynamic-filtering.html
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Tue, 12 Jul 2022 at 09:59, Jingsong Li  wrote:
> >
> > > Thanks Godfrey for driving.
> > >
> > > I like this FLIP.
> > >
> > > We can restrict this capability to more than just partitions.
> > > Here are some inputs from Lake Storage.
> > >
> > > The format of the splits generated by Lake Storage is roughly as follows:
> > > Split {
> > >Path filePath;
> > >Statistics[] fieldStats;
> > > }
> > >
> > > Stats contain the min and max of each column.
> > >
> > > If the storage is sorted by a column, this means that the split
> > > filtering on that column will be very good, so not only the partition
> > > field, but also this column is worthy of being pushed down the
> > > RuntimeFilter.
> > > This information can only be known by source, so I suggest that source
> > > return which fields are worthy of being pushed down.
> > >
> > > My overall point is:
> > > This FLIP can be extended to support Source Runtime Filter push-down
> > > for all fields, not just dynamic partition pruning.
> > >
> > > What do you think?
> > >
> > > Best,
> > > Jingsong
> > >
> > > On Fri,

Re: [DISCUSS] FLIP-247 Bulk fetch of table and column statistics for given partitions

2022-07-15 Thread godfrey he
Hi Jing,

Thanks for the driving this, LGTM.

Best,
Godfrey

Jingsong Li  于2022年7月15日周五 11:38写道:
>
> Thanks for starting this discussion.
>
> Have we considered introducing a listPartitionWithStats() in Catalog?
>
> Best,
> Jingsong
>
> On Fri, Jul 15, 2022 at 10:08 AM Jark Wu  wrote:
> >
> > Hi Jing,
> >
> > Thanks for starting this discussion. The bulk fetch is a great improvement
> > for the optimizer.
> > The FLIP looks good to me.
> >
> > Best,
> > Jark
> >
> > On Fri, 8 Jul 2022 at 17:36, Jing Ge  wrote:
> >
> > > Hi devs,
> > >
> > > After having multiple discussions with Jark and Goldfrey, I'd like to 
> > > start
> > > a discussion on the mailing list w.r.t. FLIP-247[1], which will
> > > significantly improve the performance by providing the bulk fetch
> > > capability for table and column statistics.
> > >
> > > Currently the statistics information about tables can only be fetched from
> > > the catalog by each given partition iteratively. Since getting statistics
> > > information from catalogs is a very heavy operation, in order to improve
> > > the query performance, we’d better provide functionality to fetch the
> > > statistics information of a table for all given partitions in one shot.
> > >
> > > Based on the manual performance test, for 2000 partitions, the cost will 
> > > be
> > > improved from 10s to 2s. The improvement result is 500%.
> > >
> > > [1]
> > >
> > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-247%3A+Bulk+fetch+of+table+and+column+statistics+for+given+partitions
> > >
> > > Best regards,
> > > Jing
> > >


[jira] [Created] (FLINK-28493) Add document to describe "ANALYZE TABLE" syntax

2022-07-11 Thread godfrey he (Jira)
godfrey he created FLINK-28493:
--

 Summary: Add document to describe "ANALYZE TABLE" syntax
 Key: FLINK-28493
 URL: https://issues.apache.org/jira/browse/FLINK-28493
 Project: Flink
  Issue Type: Sub-task
  Components: Documentation
Reporter: godfrey he
 Fix For: 1.16.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-28492) Support "ANALYZE TABLE" execution

2022-07-11 Thread godfrey he (Jira)
godfrey he created FLINK-28492:
--

 Summary: Support "ANALYZE TABLE" execution
 Key: FLINK-28492
 URL: https://issues.apache.org/jira/browse/FLINK-28492
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / Planner
Reporter: godfrey he
 Fix For: 1.16.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-28491) Introduce APPROX_COUNT_DISTINCT aggregate function for batch sql

2022-07-11 Thread godfrey he (Jira)
godfrey he created FLINK-28491:
--

 Summary: Introduce APPROX_COUNT_DISTINCT aggregate function for 
batch sql
 Key: FLINK-28491
 URL: https://issues.apache.org/jira/browse/FLINK-28491
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / Planner
Reporter: godfrey he
 Fix For: 1.16.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-28490) Introduce "ANALYZE TABLE" Syntax in sql parser

2022-07-11 Thread godfrey he (Jira)
godfrey he created FLINK-28490:
--

 Summary: Introduce "ANALYZE TABLE" Syntax in sql parser
 Key: FLINK-28490
 URL: https://issues.apache.org/jira/browse/FLINK-28490
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / API
Reporter: godfrey he
 Fix For: 1.16.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-28489) FLIP-240: Introduce "ANALYZE TABLE" Syntax

2022-07-11 Thread godfrey he (Jira)
godfrey he created FLINK-28489:
--

 Summary: FLIP-240: Introduce "ANALYZE TABLE" Syntax
 Key: FLINK-28489
 URL: https://issues.apache.org/jira/browse/FLINK-28489
 Project: Flink
  Issue Type: New Feature
  Components: Table SQL / Planner
Reporter: godfrey he
 Fix For: 1.16.0


see [FLIP 
doc|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=217386481] 
for more details



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] Creating benchmark channel in Apache Flink slack

2022-07-11 Thread godfrey he
+1, Thanks for driving this!

Best,
Godfrey

Yuan Mei  于2022年7月11日周一 16:13写道:
>
> +1 (binding) & thanks for the efforts!
>
> Best
> Yuan
>
>
>
> On Mon, Jul 11, 2022 at 2:08 PM Yun Gao 
> wrote:
>
> > +1 (binding)
> >
> > Thanks Anton for driving this!
> >
> >
> > Best,
> > Yun Gao
> >
> >
> > --
> > From:Anton Kalashnikov 
> > Send Time:2022 Jul. 8 (Fri.) 22:59
> > To:undefined 
> > Subject:[VOTE] Creating benchmark channel in Apache Flink slack
> >
> > Hi everyone,
> >
> >
> > I would like to start a vote for creating the new channel in Apache
> > Flink slack for sending benchamrk's result to it. This should help the
> > community to notice the performance degradation on time.
> >
> > The discussion of this idea can be found here[1]. The ticket for this is
> > here[2].
> >
> >
> > [1] https://www.mail-archive.com/dev@flink.apache.org/msg58666.html
> >
> > [2] https://issues.apache.org/jira/browse/FLINK-28468
> >
> > --
> >
> > Best regards,
> > Anton Kalashnikov


[DISCUSS] FLIP-248: Introduce dynamic partition pruning

2022-07-08 Thread godfrey he
Hi all,

I would like to open a discussion on FLIP-248: Introduce dynamic
partition pruning.

 Currently, Flink supports static partition pruning: the conditions in
the WHERE clause are analyzed
to determine in advance which partitions can be safely skipped in the
optimization phase.
Another common scenario: the partitions information is not available
in the optimization phase but in the execution phase.
That's the problem this FLIP is trying to solve: dynamic partition
pruning, which could reduce the partition table source IO.

The query pattern looks like:
select * from store_returns, date_dim where sr_returned_date_sk =
d_date_sk and d_year = 2000

We will introduce a mechanism for detecting dynamic partition pruning
patterns in optimization phase
and performing partition pruning at runtime by sending the dimension
table results to the SplitEnumerator
of fact table via existing coordinator mechanism.

You can find more details in FLIP-248 document[1].
Looking forward to your any feedback.

[1] 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-248%3A+Introduce+dynamic+partition+pruning
[2] POC: https://github.com/godfreyhe/flink/tree/FLIP-248


Best,
Godfrey


Re: Re: [VOTE] FLIP-218: Support SELECT clause in CREATE TABLE(CTAS)

2022-07-06 Thread godfrey he
+1, thanks for driving this

Best,
Godfrey

Mang Zhang  于2022年7月5日周二 11:56写道:
>
> Hi everyone,
> I'm sorry to bother you all, but since FLIP-218[1] has been updated, I'm 
> going to relaunch VOTE.
> The main contents of the modification are:
> 1. remove rtas from option name
> 2. no longer introduce AtomicCatalog, add javadocs description for Catalog 
> interface:
> If Catalog needs to support the atomicity feature of CTAS, then Catalog must 
> implement Serializable and make the Catalog instances can be 
> serializable/deserializable using Java serialization.
> When atomicity support for CTAS is enabled, Planner will check if the Catalog 
> instance can be serialized using the Java serialization.
>
>
>
>
>
>
>
> [1]  
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=199541185
>
>
>
>
> --
>
> Best regards,
> Mang Zhang
>
>
>
>
>
> At 2022-07-05 09:19:24, "Jiangang Liu"  wrote:
> >+1 for the feature.
> >
> >Jark Wu  于2022年7月4日周一 17:33写道:
> >
> >> Hi Mang,
> >>
> >> I left a comment in the DISCUSS thread.
> >>
> >> Best,
> >> Jark
> >>
> >> On Mon, 4 Jul 2022 at 15:24, Rui Fan <1996fan...@gmail.com> wrote:
> >>
> >> > Hi.
> >> >
> >> > Thanks Mang for this FLIP. I think it will be useful for users.
> >> >
> >> > +1(non-binding)
> >> >
> >> > Best wishes
> >> > Rui Fan
> >> >
> >> > On Mon, Jul 4, 2022 at 3:01 PM Mang Zhang  wrote:
> >> >
> >> > > Hi everyone,
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > Thanks for all the feedback so far. Based on the discussion [1], we
> >> seem
> >> > > to have consensus. So, I would like to start a vote on FLIP-218 [2].
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > The vote will last for at least 72 hours unless there is an objection
> >> or
> >> > > insufficient votes.
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > [1] https://lists.apache.org/thread/mc0lv4gptm7som02hpob1hdp3hb1ps1v
> >> > >
> >> > > [2]
> >> > >
> >> >
> >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=199541185
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > >
> >> > > Best regards,
> >> > > Mang Zhang
> >> >
> >>


Re: [DISCUSS] Support partition pruning for streaming reading

2022-07-03 Thread godfrey he
Hi zoucao,

Look forward your FLIP.

>For Batch reading, the 'remainingPartitions' will be seen as the partitions
>needed to consume, for streaming reading, we use the
>'partitionPruningFunction' to ignore the unneeded partitions.
There should be for bounded source(maybe batch or streaming),
`applyPartitions` should be used,
while only for unbounded source, `applyPartitionPuringFunction` can be used.

Best,
Godfrey

cao zou  于2022年7月4日周一 11:04写道:
>
> Hi Martijn, thanks for your attention, I'm glad to create a FLIP, and could
> you help give me the permission?
> My Id is zoucao, and my mail is zoucao...@gmail.com.
>
> The implications for FileSource
>
> In the above discussion, only HiveSource has been involved, because it
> holds a continuous partition fetcher, but FileSource not. If we do the
> streaming pruning only in the partition fetcher, it will not affect the
> FileSource. If the FileSource supports streaming reading in the future, the
> same changes can be applied to it.
>
> Best regards,
> zoucao
>
> Martijn Visser  于2022年7月1日周五 16:20写道:
>
> > Hi zoucao,
> >
> > I think this topic deserves a proper FLIP and a vote. This approach is
> > focussed only on Hive, but I would also like to understand the implications
> > for FileSource. Can you create one?
> >
> > Best regards,
> >
> > Martijn
> >
> > Op wo 22 jun. 2022 om 18:50 schreef cao zou :
> >
> > > Hi devs, I want to start a discussion to find a way to support partition
> > > pruning for streaming reading.
> > >
> > >
> > > Now, Flink has supported the partition pruning, the implementation
> > consists
> > > of *Source Ability*, *Logical Rule*, and the interface
> > > *SupportsPartitionPushDown*, but they all only take effect in batch
> > > reading. When reading a table in streaming mode, the existing mechanism
> > > will cause some problems posted by FLINK-27898
> > > [1], and the records
> > > that should be filtered will be sent downstream.
> > >
> > > To solve this drawback, this discussion is proposed, and the Hive and
> > other
> > > BigData systems stored with partitions will benefit more from it.
> > >
> > >  Now, the existing partitions which are needed to consume will be
> > generated
> > > in *PushPartitionIntoTableSourceScanRule*. Then, the partitions will be
> > > pushed into TableSource. It’s working well in batch mode, but if we want
> > to
> > > read records from Hive in streaming mode, and consider the partitions
> > > committed in the future, it’s not enough.
> > >
> > > To support pruning the partitions committed in the feature, the pruning
> > > function should be pushed into the TableSource, and then delivered to
> > > *ContinuousPartitionFetcher*, such that the pruning for uncommitted
> > > partitions can be invoked here.
> > >
> > > Before proposing the changes, I think it is necessary to clarify the
> > > existing pruning logic. The main logic of the pruning in
> > > *PushPartitionIntoTableSourceScanRule* is as follows.
> > >
> > > Firstly, generating a pruning function called partitionPruner, the
> > function
> > > is extended from a RichMapFunction.
> > >
> > >
> > > if tableSource.listPartitions() is not empty:
> > >   partitions = dynamicTableSource.listPartitions()
> > >
> > >   for p in partitions:
> > > boolean predicate = partitionPruner.map(convertPartitionToRow(p))
> > >
> > > add p to partitionsAfterPruning where the predicate is true.
> > >
> > > else  tableSource.listPartitions() is empty:
> > >   if the filter can be converted to ResolvedExpression &&
> > > the catalog can support the filter :
> > >
> > > partitionsAfterPruning = catalog.listPartitionsByFilter()
> > >
> > > the value of partitionsAfterPruning is all needed.
> > >   else :
> > >
> > > partitions = catalog.listPartitions()
> > > for p in partitions:
> > > boolean predicate = partitionPruner.map(convertPartitionToRow(p))
> > >
> > >  add p to partitionsAfterPruning where the predicate is true.
> > >
> > > I think the main logic can be classified into two sides, one exists in
> > the
> > > logical rule, and the other exists in the connector side. The catalog
> > info
> > > should be used on the rule side, and not on the connector side, the
> > pruning
> > > function could be used on both of them or unified on the connector side.
> > >
> > >
> > > Proposed changes
> > >
> > >
> > >- add a new method in SupportsPartitionPushDown
> > >- let HiveSourceTable, HiveSourceBuilder, and
> > >HiveContinuousPartitionFetcher hold the pruning function.
> > >- pruning after fetchPartitions invoked.
> > >
> > > Considering the version compatibility and the optimization for the method
> > > of listing partitions with filter in the catalog, I think we can add a
> > new
> > > method in *SupportsPartitionPushDown*
> > >
> > > /**
> > > * Provides a list of remaining partitions. After those partitions are
> > > applied, a source must
> > > * not read the data of oth

[RESULT][VOTE] FLIP-240: Introduce "ANALYZE TABLE" Syntax

2022-06-29 Thread godfrey he
Hi, everyone.

FLIP-240: Introduce "ANALYZE TABLE" Syntax[1] has been accepted.

There are 3 binding votes, 2 non-binding votes[2].
- Jingsong Li(binding)
- Shengkai Fang(binding)
- Roc Marshal(non-binding)
- Leonard Xu(binding)
- yuxia(non-binding)

None against.

Thanks again for every one who concerns on this FLIP.


[1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=217386481
[2] https://lists.apache.org/thread/lltcom64dzzf324fpl19nfzxb56k37wd


Best,
Godfrey


[VOTE] FLIP-240: Introduce "ANALYZE TABLE" Syntax

2022-06-24 Thread godfrey he
Hi everyone,

Thanks for all the feedback so far. Based on the discussion[1] we seem
to have consensus, so I would like to start a vote on FLIP-240 for
which the FLIP has now also been updated[2].

The vote will last for at least 72 hours (Jun 29th 14:00 GMT,
excluding weekend days) unless
there is an objection or insufficient votes.

[1] https://lists.apache.org/thread/rsrykky79rzp1nyrkff0tl3xc7hsv31q
[2] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=217386481

Best,
Godfrey


Re: Re: [ANNOUNCE] New Apache Flink Committers: Qingsheng Ren, Shengkai Fang

2022-06-24 Thread godfrey he
Congrats, Shengkai and Qingsheng!

Best,
Godfrey

Yu Li  于2022年6月22日周三 23:55写道:
>
> Congrats and welcome, Qingsheng and Shengkai!
>
> Best Regards,
> Yu
>
>
> On Wed, 22 Jun 2022 at 17:43, Jiangang Liu 
> wrote:
>
> > Congratulations!
> >
> > Best,
> > Jiangang Liu
> >
> > Mason Chen  于2022年6月22日周三 00:37写道:
> >
> > > Awesome work Qingsheng and Shengkai!
> > >
> > > Best,
> > > Mason
> > >
> > > On Tue, Jun 21, 2022 at 4:53 AM Zhipeng Zhang 
> > > wrote:
> > >
> > > > Congratulations, Qingsheng and ShengKai.
> > > >
> > > > Yang Wang  于2022年6月21日周二 19:43写道:
> > > >
> > > > > Congratulations, Qingsheng and ShengKai.
> > > > >
> > > > >
> > > > > Best,
> > > > > Yang
> > > > >
> > > > > Benchao Li  于2022年6月21日周二 19:33写道:
> > > > >
> > > > > > Congratulations!
> > > > > >
> > > > > > weijie guo  于2022年6月21日周二 13:44写道:
> > > > > >
> > > > > > > Congratulations, Qingsheng and ShengKai!
> > > > > > >
> > > > > > > Best regards,
> > > > > > >
> > > > > > > Weijie
> > > > > > >
> > > > > > >
> > > > > > > Yuan Mei  于2022年6月21日周二 13:07写道:
> > > > > > >
> > > > > > > > Congrats Qingsheng and ShengKai!
> > > > > > > >
> > > > > > > > Best,
> > > > > > > >
> > > > > > > > Yuan
> > > > > > > >
> > > > > > > > On Tue, Jun 21, 2022 at 11:27 AM Terry Wang <
> > zjuwa...@gmail.com>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Congratulations, Qingsheng and ShengKai!
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > >
> > > > > > Best,
> > > > > > Benchao Li
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > best,
> > > > Zhipeng
> > > >
> > >
> >


Re: Re: Re: [DISCUSS] FLIP-218: Support SELECT clause in CREATE TABLE(CTAS)

2022-06-24 Thread godfrey he
Hi all,

Sorry for the late reply.

>table.cor-table-as-select.atomicity-enabled
Regarding `cor`,  this abbreviation is not commonly used.

>Create Table As Select(CTAS) feature depends on the serializability of the 
>catalog. To quickly see if the catalog supports CTAS, we need to try to 
>serialize the catalog when compile SQL in planner and if it fails, an 
>exception will be >thrown to indicate to user that the catalog does not 
>support CTAS because it cannot be serialized.
This behavior is too cryptic, and will break the current catalog
behavior when using 1.16.
I suggest we introduce a new interface for atomic catalog which
implements Serializable.
 The existent catalogs can choose whether implements the new catalog interface.

> Catalog#inferTableOptions
I strongly recommend not introducing this feature now, because the
behavior is unclear.
1) if the catalog support managed table, the connector option is
empty. but if user forget to
set connector option for CTAS statement, the created table will be
managed table.
2) the options and its values for catalog and for connector may be different,
so use the catalog option may cause expected errors.

> StreamGraph#addJobStatusHook
I prefer `registerJobStatusHook`

Best,
Godfrey

Mang Zhang  于2022年6月13日周一 16:43写道:
>
> Hi Yun,
> Thanks for your reply!
> Through offline communication with Dalong, I updated the JobStatusHook part 
> to FLIP, looking forward to your feedback.
>
>
>
> --
>
> Best regards,
> Mang Zhang
>
>
>
>
>
> At 2022-05-31 14:34:25, "Yun Gao"  wrote:
> >Hi,
> >
> >Regarding the drop operation, with some offline discussion with Dalong and 
> >Zhu,
> >we think that listening in the client side might be problematic since it 
> >would exit
> >after submitting the jobs in detached mode, thus the operation might need to
> >be in the JobMaster side.
> >
> >For the listener interface, currently JobListener only resides in the client 
> >side
> >and contains unsuitable methods like onJobSubmitted for this scenario, and
> >the internal JobStatusListener is designed to be used inside JM and is not
> >serializable, thus we tend to add a new interface JobStatusHook,
> >which could be attached to the JobGraph and executed in the JobMaster.
> >The interface will also be marked as Internal.
> >
> >Best,
> >Yun
> >
> >
> >--
> >From:Mang Zhang 
> >Send Time:2022 May 25 (Wed.) 10:24
> >To:dev 
> >Subject:Re:Re: [DISCUSS] FLIP-218: Support SELECT clause in CREATE 
> >TABLE(CTAS)
> >
> >Hi, Martijn
> >Thanks for your reply!
> >I looked at the SQL standard, CTAS is part of the SQL standard.
> >Feature T172 is "AS subquery clause in table definition".
> >
> >
> >
> >--
> >
> >Best regards,
> >Mang Zhang
> >
> >
> >
> >
> >
> >At 2022-05-04 21:49:00, "Martijn Visser"  wrote:
> >>Hi everyone,
> >>
> >>Can we identify if this proposed syntax is part of the SQL standard?
> >>
> >>Best regards,
> >>
> >>Martijn Visser
> >>https://twitter.com/MartijnVisser82
> >>https://github.com/MartijnVisser
> >>
> >>
> >>On Fri, 29 Apr 2022 at 11:19, yuxia  wrote:
> >>
> >>> Thanks for for driving this work, it's to be a useful feature.
> >>> About the flip-218, I have some questions.
> >>>
> >>> 1: Does our CTAS syntax support specify target table's schema including
> >>> column name and data type? I think it maybe a useful fature in case we 
> >>> want
> >>> to change the data types in target table instead of always copy the source
> >>> table's schema. It'll be more flexible with this feature.
> >>> Btw, MySQL's "CREATE TABLE ... SELECT Statement"[1] support this feature.
> >>>
> >>> 2: Seems it'll requre sink to implement an public interface to drop table,
> >>> so what's the interface will look like?
> >>>
> >>> [1] https://dev.mysql.com/doc/refman/8.0/en/create-table-select.html
> >>>
> >>> Best regards,
> >>> Yuxia
> >>>
> >>> - 原始邮件 -
> >>> 发件人: "Mang Zhang" 
> >>> 收件人: "dev" 
> >>> 发送时间: 星期四, 2022年 4 月 28日 下午 4:57:24
> >>> 主题: [DISCUSS] FLIP-218: Support SELECT clause in CREATE TABLE(CTAS)
> >>>
> >>> Hi, everyone
> >>>
> >>>
> >>> I would like to open a discussion for support select clause in CREATE
> >>> TABLE(CTAS),
> >>> With the development of business and the enhancement of flink sql
> >>> capabilities, queries become more and more complex.
> >>> Now the user needs to use the Create Table statement to create the target
> >>> table first, and then execute the insert statement.
> >>> However, the target table may have many columns, which will bring a lot of
> >>> work outside the business logic to the user.
> >>> At the same time, ensure that the schema of the created target table is
> >>> consistent with the schema of the query result.
> >>> Using a CTAS syntax like Hive/Spark can greatly facilitate the user.
> >>>
> >>>
> >>>
> >>> You can find more details in FLIP-218[1]. Looking forward to your 
> >>> feedback.
> >>>
> >>>
> >>>
> >>> [1]
> >>> https://cwiki.apache.org/confluence/display/FLINK/

Re: Re: [DISCUSS] FLIP-240: Introduce "ANALYZE TABLE" Syntax

2022-06-23 Thread godfrey he
Hi, everyone.

Thanks for all the inputs.
If there is no feedback any more, I will start the vote tomorrow.

Best,
Godfrey

Jing Ge  于2022年6月22日周三 15:50写道:
>
> sounds good to me. Thanks!
>
> Best regards,
> Jing
>
> On Fri, Jun 17, 2022 at 5:37 AM godfrey he  wrote:
>
> > Hi, Jing.
> >
> > Thanks for the feedback.
> >
> > >When will the converted SELECT statement of the ANALYZE TABLE be
> > > submitted? right after the CREATE TABLE?
> > The SELECT  job will be submitted only when `ANALYZE TABLE` is executed,
> > and there is nothing to do with CREATE TABLE. Because the `ANALYZE TABLE`
> > is triggered manually as needed.
> >
> > >Will it be submitted periodically to keep the statistical data
> > >up-to-date, since the data might be mutable?
> > the `ANALYZE TABLE` is triggered manually as needed.
> > I will update the doc.
> >
> > >It might not be strong enough to avoid human error
> > > I would suggest using FOR ALL PARTITIONS explicitly
> > > just like FOR ALL COLUMNS.
> > Agree, specifying `PARTITION` explicitly is more friendly
> > and safe. I prefer to use `PARTITION(ds, hr)` without
> > specific partition value, hive has the similar syntax.
> > WDYT ?
> >
> > Best,
> > Godfrey
> >
> > Jing Ge  于2022年6月16日周四 03:53写道:
> > >
> > > Hi Godfrey,
> > >
> > > Thanks for driving this! There are some areas where I couldn't find
> > enough
> > > information in the FLIP, just wondering if I could get more
> > > explanation from you w.r.t. the following questions:
> > >
> > > 1. When will the converted SELECT statement of the ANALYZE TABLE be
> > > submitted? right after the CREATE TABLE?
> > >
> > > 2. Will it be submitted periodically to keep the statistical data
> > > up-to-date, since the data might be mutable?
> > >
> > > 3. " If no partition is specified, the statistics will be gathered for
> > all
> > > partitions"  - I think this is fine for multi-level partitions, e.g.
> > PARTITION
> > > (ds='2022-06-01') means two partitions: PARTITION (ds='2022-06-01', hr=1)
> > > and PARTITION (ds='2022-06-01', hr=2), because it will save a lot of code
> > > and therefore help developer work more efficiently. If we use this rule
> > for
> > > top level partitions, It might not be strong enough to avoid human
> > > error, e.g. developer might trigger huge selection on the table with many
> > > partitions, when he forgot to write the partition in the ANALYZE TABLE
> > > script. In this case, I would suggest using FOR ALL PARTITIONS explicitly
> > > just like FOR ALL COLUMNS.
> > >
> > > Best regards,
> > > Jing
> > >
> > >
> > > On Wed, Jun 15, 2022 at 10:16 AM godfrey he  wrote:
> > >
> > > > Hi Jark,
> > > >
> > > > Thanks for the inputs.
> > > >
> > > > >Do we need to provide DESC EXTENDED  statement like
> > Spark[1]
> > > > to
> > > > >show statistic for table/partition/columns?
> > > > We do have supported `DESC EXTENDED` syntax, but currently only table
> > > > schema
> > > > will be display, I think we just need a JIRA to support it.
> > > >
> > > > > is it possible to ignore execution mode and force using batch mode
> > for
> > > > the statement?
> > > > As I replied above, The semantics of `ANALYZE TABLE` does not
> > > > distinguish batch and streaming,
> > > > It works for both batch and streaming, but the result of unbounded
> > > > sources is meaningless.
> > > > Currently, I throw exception for streaming mode,
> > > > and we can support streaming mode with bounded source in the future.
> > > >
> > > > Best,
> > > > Godfrey
> > > >
> > > > Jark Wu  于2022年6月14日周二 17:56写道:
> > > > >
> > > > > Hi Godfrey, thanks for starting this discussion, this is a great
> > feature
> > > > > for batch users.
> > > > >
> > > > > The FLIP looks good to me in general.
> > > > >
> > > > > I only have 2 comments:
> > > > >
> > > > > 1) How do users know whether the given table or partition contains
> > > > required
> > > > > statistics?
> > > > > Do we need to provide DESC EXTENDED  statement like
> >

Re: Re: [DISCUSS] FLIP-240: Introduce "ANALYZE TABLE" Syntax

2022-06-16 Thread godfrey he
Hi Jark,

I have created the issue and will be done in release 1.16,
see https://issues.apache.org/jira/browse/FLINK-28074

Best,
Godfrey

Jark Wu  于2022年6月16日周四 18:03写道:
>
> Hi Godfrey,
>
> > we just need a JIRA to support it.
> Could you create the JIRA issue? I think it would be better if we can
> support
> `DESC EXTENDED` and `ANALYZE TABLE` together in the 1.16 release.
> Otherwise, it's hard for users to determine when to call ANALYZE TABLE.
>
>
> Best,
> Jark
>
> On Thu, 16 Jun 2022 at 03:53, Jing Ge  wrote:
>
> > Hi Godfrey,
> >
> > Thanks for driving this! There are some areas where I couldn't find enough
> > information in the FLIP, just wondering if I could get more
> > explanation from you w.r.t. the following questions:
> >
> > 1. When will the converted SELECT statement of the ANALYZE TABLE be
> > submitted? right after the CREATE TABLE?
> >
> > 2. Will it be submitted periodically to keep the statistical data
> > up-to-date, since the data might be mutable?
> >
> > 3. " If no partition is specified, the statistics will be gathered for all
> > partitions"  - I think this is fine for multi-level partitions, e.g.
> > PARTITION
> > (ds='2022-06-01') means two partitions: PARTITION (ds='2022-06-01', hr=1)
> > and PARTITION (ds='2022-06-01', hr=2), because it will save a lot of code
> > and therefore help developer work more efficiently. If we use this rule for
> > top level partitions, It might not be strong enough to avoid human
> > error, e.g. developer might trigger huge selection on the table with many
> > partitions, when he forgot to write the partition in the ANALYZE TABLE
> > script. In this case, I would suggest using FOR ALL PARTITIONS explicitly
> > just like FOR ALL COLUMNS.
> >
> > Best regards,
> > Jing
> >
> >
> > On Wed, Jun 15, 2022 at 10:16 AM godfrey he  wrote:
> >
> > > Hi Jark,
> > >
> > > Thanks for the inputs.
> > >
> > > >Do we need to provide DESC EXTENDED  statement like Spark[1]
> > > to
> > > >show statistic for table/partition/columns?
> > > We do have supported `DESC EXTENDED` syntax, but currently only table
> > > schema
> > > will be display, I think we just need a JIRA to support it.
> > >
> > > > is it possible to ignore execution mode and force using batch mode for
> > > the statement?
> > > As I replied above, The semantics of `ANALYZE TABLE` does not
> > > distinguish batch and streaming,
> > > It works for both batch and streaming, but the result of unbounded
> > > sources is meaningless.
> > > Currently, I throw exception for streaming mode,
> > > and we can support streaming mode with bounded source in the future.
> > >
> > > Best,
> > > Godfrey
> > >
> > > Jark Wu  于2022年6月14日周二 17:56写道:
> > > >
> > > > Hi Godfrey, thanks for starting this discussion, this is a great
> > feature
> > > > for batch users.
> > > >
> > > > The FLIP looks good to me in general.
> > > >
> > > > I only have 2 comments:
> > > >
> > > > 1) How do users know whether the given table or partition contains
> > > required
> > > > statistics?
> > > > Do we need to provide DESC EXTENDED  statement like
> > Spark[1]
> > > to
> > > > show statistic for table/partition/columns?
> > > >
> > > > 2) If ANALYZE TABLE can only run in batch mode, is it possible to
> > ignore
> > > > execution mode
> > > > and force using batch mode for the statement? From my perspective,
> > > ANALYZE
> > > > TABLE
> > > > is an auxiliary statement similar to SHOW TABLES but heavier, which
> > > doesn't
> > > > care about
> > > > environment execution mode.
> > > >
> > > > Best,
> > > > Jark
> > > >
> > > > [1]:
> > > >
> > >
> > https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-aux-analyze-table.html
> > > >
> > > > On Tue, 14 Jun 2022 at 13:52, Jing Ge  wrote:
> > > >
> > > > > Hi 华宗
> > > > >
> > > > > 退订请发送任意消息至dev-unsubscr...@flink.apache.org
> > > > > In order to unsubscribe, please send an email to
> > > > > dev-unsubscr...@flink.apache.org
> > > > >
> > > > > Thanks
> > > >

Re: Re: [DISCUSS] FLIP-240: Introduce "ANALYZE TABLE" Syntax

2022-06-16 Thread godfrey he
Hi, Jing.

Thanks for the feedback.

>When will the converted SELECT statement of the ANALYZE TABLE be
> submitted? right after the CREATE TABLE?
The SELECT  job will be submitted only when `ANALYZE TABLE` is executed,
and there is nothing to do with CREATE TABLE. Because the `ANALYZE TABLE`
is triggered manually as needed.

>Will it be submitted periodically to keep the statistical data
>up-to-date, since the data might be mutable?
the `ANALYZE TABLE` is triggered manually as needed.
I will update the doc.

>It might not be strong enough to avoid human error
> I would suggest using FOR ALL PARTITIONS explicitly
> just like FOR ALL COLUMNS.
Agree, specifying `PARTITION` explicitly is more friendly
and safe. I prefer to use `PARTITION(ds, hr)` without
specific partition value, hive has the similar syntax.
WDYT ?

Best,
Godfrey

Jing Ge  于2022年6月16日周四 03:53写道:
>
> Hi Godfrey,
>
> Thanks for driving this! There are some areas where I couldn't find enough
> information in the FLIP, just wondering if I could get more
> explanation from you w.r.t. the following questions:
>
> 1. When will the converted SELECT statement of the ANALYZE TABLE be
> submitted? right after the CREATE TABLE?
>
> 2. Will it be submitted periodically to keep the statistical data
> up-to-date, since the data might be mutable?
>
> 3. " If no partition is specified, the statistics will be gathered for all
> partitions"  - I think this is fine for multi-level partitions, e.g. PARTITION
> (ds='2022-06-01') means two partitions: PARTITION (ds='2022-06-01', hr=1)
> and PARTITION (ds='2022-06-01', hr=2), because it will save a lot of code
> and therefore help developer work more efficiently. If we use this rule for
> top level partitions, It might not be strong enough to avoid human
> error, e.g. developer might trigger huge selection on the table with many
> partitions, when he forgot to write the partition in the ANALYZE TABLE
> script. In this case, I would suggest using FOR ALL PARTITIONS explicitly
> just like FOR ALL COLUMNS.
>
> Best regards,
> Jing
>
>
> On Wed, Jun 15, 2022 at 10:16 AM godfrey he  wrote:
>
> > Hi Jark,
> >
> > Thanks for the inputs.
> >
> > >Do we need to provide DESC EXTENDED  statement like Spark[1]
> > to
> > >show statistic for table/partition/columns?
> > We do have supported `DESC EXTENDED` syntax, but currently only table
> > schema
> > will be display, I think we just need a JIRA to support it.
> >
> > > is it possible to ignore execution mode and force using batch mode for
> > the statement?
> > As I replied above, The semantics of `ANALYZE TABLE` does not
> > distinguish batch and streaming,
> > It works for both batch and streaming, but the result of unbounded
> > sources is meaningless.
> > Currently, I throw exception for streaming mode,
> > and we can support streaming mode with bounded source in the future.
> >
> > Best,
> > Godfrey
> >
> > Jark Wu  于2022年6月14日周二 17:56写道:
> > >
> > > Hi Godfrey, thanks for starting this discussion, this is a great feature
> > > for batch users.
> > >
> > > The FLIP looks good to me in general.
> > >
> > > I only have 2 comments:
> > >
> > > 1) How do users know whether the given table or partition contains
> > required
> > > statistics?
> > > Do we need to provide DESC EXTENDED  statement like Spark[1]
> > to
> > > show statistic for table/partition/columns?
> > >
> > > 2) If ANALYZE TABLE can only run in batch mode, is it possible to ignore
> > > execution mode
> > > and force using batch mode for the statement? From my perspective,
> > ANALYZE
> > > TABLE
> > > is an auxiliary statement similar to SHOW TABLES but heavier, which
> > doesn't
> > > care about
> > > environment execution mode.
> > >
> > > Best,
> > > Jark
> > >
> > > [1]:
> > >
> > https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-aux-analyze-table.html
> > >
> > > On Tue, 14 Jun 2022 at 13:52, Jing Ge  wrote:
> > >
> > > > Hi 华宗
> > > >
> > > > 退订请发送任意消息至dev-unsubscr...@flink.apache.org
> > > > In order to unsubscribe, please send an email to
> > > > dev-unsubscr...@flink.apache.org
> > > >
> > > > Thanks
> > > >
> > > > Best regards,
> > > > Jing
> > > >
> > > >
> > > > On Tue, Jun 14, 2022 at 2:05 AM 华宗  wrote:
> > > >
> > > &g

Re: [VOTE] FLIP-222: Support full job lifecycle statements in SQL client

2022-06-16 Thread godfrey he
+1

Best,
Godfrey

Martijn Visser  于2022年6月13日周一 20:25写道:
>
> +1 (binding)
>
> Op ma 13 jun. 2022 om 13:52 schreef Jark Wu :
>
> > +1 (binding)
> >
> > nit: the stop job syntax should be
> > STOP JOB '' [WITH SAVEPOINT] [WITH DRAIN]
> > which means savepoint and drain can be enabled at the same time.
> >
> > Best,
> > Jark
> >
> > On Mon, 13 Jun 2022 at 17:20, Jing Ge  wrote:
> >
> > > +1 (not binding)
> > >
> > > Thanks Paul for your effort!
> > >
> > > Best regards,
> > > Jing
> > >
> > > On Mon, Jun 13, 2022 at 11:11 AM Paul Lam  wrote:
> > >
> > > > Hi team,
> > > >
> > > > I'd like to start a vote for FLIP-222: Support full job lifecycle
> > > > statements in SQL client [1],
> > > > which is discussed on the thread [2].
> > > >
> > > > The vote will be open for at least 72 hours unless there is an
> > objection
> > > > or not enough votes.
> > > >
> > > > Thanks everyone who has participated in the discussion!
> > > >
> > > > [1]
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-222%3A+Support+full+job+lifecycle+statements+in+SQL+client
> > > > [2] https://lists.apache.org/thread/qkvh9p5w9b12s7ykh3l7lv7m9dbgnf1g
> > > >
> > > > Best,
> > > > Paul Lam
> > > >
> > > >
> > >
> >


[jira] [Created] (FLINK-28075) get statistics for partitioned table even without partition pruning

2022-06-15 Thread godfrey he (Jira)
godfrey he created FLINK-28075:
--

 Summary: get statistics for partitioned table even without 
partition pruning
 Key: FLINK-28075
 URL: https://issues.apache.org/jira/browse/FLINK-28075
 Project: Flink
  Issue Type: New Feature
  Components: Table SQL / Planner
Reporter: godfrey he
 Fix For: 1.16.0


Currently, the statistics for partitioned table will not be collected if there 
is no partition pruning



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (FLINK-28074) display

2022-06-15 Thread godfrey he (Jira)
godfrey he created FLINK-28074:
--

 Summary: display
 Key: FLINK-28074
 URL: https://issues.apache.org/jira/browse/FLINK-28074
 Project: Flink
  Issue Type: New Feature
Reporter: godfrey he






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


Re: Re: [DISCUSS] FLIP-240: Introduce "ANALYZE TABLE" Syntax

2022-06-15 Thread godfrey he
Hi Jark,

Thanks for the inputs.

>Do we need to provide DESC EXTENDED  statement like Spark[1] to
>show statistic for table/partition/columns?
We do have supported `DESC EXTENDED` syntax, but currently only table schema
will be display, I think we just need a JIRA to support it.

> is it possible to ignore execution mode and force using batch mode for the 
> statement?
As I replied above, The semantics of `ANALYZE TABLE` does not
distinguish batch and streaming,
It works for both batch and streaming, but the result of unbounded
sources is meaningless.
Currently, I throw exception for streaming mode,
and we can support streaming mode with bounded source in the future.

Best,
Godfrey

Jark Wu  于2022年6月14日周二 17:56写道:
>
> Hi Godfrey, thanks for starting this discussion, this is a great feature
> for batch users.
>
> The FLIP looks good to me in general.
>
> I only have 2 comments:
>
> 1) How do users know whether the given table or partition contains required
> statistics?
> Do we need to provide DESC EXTENDED  statement like Spark[1] to
> show statistic for table/partition/columns?
>
> 2) If ANALYZE TABLE can only run in batch mode, is it possible to ignore
> execution mode
> and force using batch mode for the statement? From my perspective, ANALYZE
> TABLE
> is an auxiliary statement similar to SHOW TABLES but heavier, which doesn't
> care about
> environment execution mode.
>
> Best,
> Jark
>
> [1]:
> https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-aux-analyze-table.html
>
> On Tue, 14 Jun 2022 at 13:52, Jing Ge  wrote:
>
> > Hi 华宗
> >
> > 退订请发送任意消息至dev-unsubscr...@flink.apache.org
> > In order to unsubscribe, please send an email to
> > dev-unsubscr...@flink.apache.org
> >
> > Thanks
> >
> > Best regards,
> > Jing
> >
> >
> > On Tue, Jun 14, 2022 at 2:05 AM 华宗  wrote:
> >
> > > 退订
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > At 2022-06-13 22:44:24, "cao zou"  wrote:
> > > >Hi godfrey, thanks for your detail explanation.
> > > >After explaining and glancing over the FLIP-231, I think it is
> > > >really need, +1 for this and looking forward to it.
> > > >
> > > >best
> > > >zoucao
> > > >
> > > >godfrey he  于2022年6月13日周一 14:43写道:
> > > >
> > > >> Hi Ingo,
> > > >>
> > > >> The semantics does not distinguish batch and streaming,
> > > >> It works for both batch and streaming, but the result of
> > > >> unbounded sources is meaningless.
> > > >> Currently, I throw exception for streaming mode,
> > > >> and we can support streaming mode with bounded source
> > > >> in the future.
> > > >>
> > > >> Best,
> > > >> Godfrey
> > > >>
> > > >> Ingo Bürk  于2022年6月13日周一 14:17写道:
> > > >> >
> > > >> > Hi Godfrey,
> > > >> >
> > > >> > thank you for the explanation. A SELECT is definitely more generic
> > and
> > > >> > will work for all connectors automatically. As such I think it's a
> > > good
> > > >> > baseline solution regardless.
> > > >> >
> > > >> > We can also think about allowing connector-specific optimizations in
> > > the
> > > >> > future, but I do like your idea of letting the optimizer rules
> > > perform a
> > > >> > lot of the work here already by leveraging existing optimizations.
> > > >> > Similarly things like non-null counts of non-nullable columns would
> > > (or
> > > >> > at least could) be handled by the optimizer rules already.
> > > >> >
> > > >> > So as far as that point goes, +1 to the generic approach.
> > > >> >
> > > >> > One more point, though: In general we should avoid supporting
> > features
> > > >> > only in specific modes as it breaks the unification promise. Given
> > > that
> > > >> > ANALYZE is a manual and completely optional operation I'm OK with
> > > doing
> > > >> > that here in principle. However, I wonder what will happen in the
> > > >> > streaming / unbounded case. Do you plan to throw an error? Or 

Re: [ANNOUNCE] New Apache Flink PMC Member - Jingsong Lee

2022-06-13 Thread godfrey he
Congratulations, Jingsong!

Best,
Godfrey

Shuo Cheng  于2022年6月13日周一 16:43写道:
>
> Congratulations, Jingsong!
>
> On 6/13/22, Paul Lam  wrote:
> > Congrats, Jingsong! Well deserved!
> >
> > Best,
> > Paul Lam
> >
> >> 2022年6月13日 16:31,Lincoln Lee  写道:
> >>
> >> Congratulations, Jingsong!
> >>
> >> Best,
> >> Lincoln Lee
> >>
> >>
> >> Jark Wu  于2022年6月13日周一 16:29写道:
> >>
> >>> Congrats, Jingsong!
> >>>
> >>> Cheers,
> >>> Jark
> >>>
> >>> On Mon, 13 Jun 2022 at 16:16, Jiangang Liu 
> >>> wrote:
> >>>
>  Congratulations, Jingsong!
> 
>  Best,
>  Jiangang Liu
> 
>  Martijn Visser  于2022年6月13日周一 16:06写道:
> 
> > Like everyone has mentioned, this is very well deserved.
> >>> Congratulations!
> >
> > Op ma 13 jun. 2022 om 09:57 schreef Benchao Li :
> >
> >> Congratulations, Jingsong!  Well deserved.
> >>
> >> Rui Fan <1996fan...@gmail.com> 于2022年6月13日周一 15:53写道:
> >>
> >>> Congratulations, Jingsong!
> >>>
> >>> Best,
> >>> Rui Fan
> >>>
> >>> On Mon, Jun 13, 2022 at 3:40 PM LuNing Wang  
> >> wrote:
> >>>
>  Congratulations, Jingsong!
> 
>  Best,
>  LuNing Wang
> 
>  Ingo Bürk  于2022年6月13日周一 15:36写道:
> 
> > Congrats, Jingsong!
> >
> > On 13.06.22 08:58, Becket Qin wrote:
> >> Hi all,
> >>
> >> I'm very happy to announce that Jingsong Lee has joined the
>  Flink
> >>> PMC!
> >>
> >> Jingsong became a Flink committer in Feb 2020 and has been
> >>> continuously
> >> contributing to the project since then, mainly in Flink SQL.
> >>> He
> > has
>  been
> >> quite active in the mailing list, fixing bugs, helping
>  verifying
> > releases,
> >> reviewing patches and FLIPs. Jingsong is also devoted to
>  pushing
> >>> Flink
> > SQL
> >> to new use cases. He spent a lot of time in implementing the
> > Flink
> >> connectors for Apache Iceberg. Jingsong is also the primary
> > driver
>  behind
> >> the effort of flink-table-store, which aims to provide a
> >> stream-batch
> >> unified storage for Flink dynamic tables.
> >>
> >> Congratulations and welcome, Jingsong!
> >>
> >> Cheers,
> >>
> >> Jiangjie (Becket) Qin
> >> (On behalf of the Apache Flink PMC)
> >>
> >
> 
> >>>
> >>
> >>
> >> --
> >>
> >> Best,
> >> Benchao Li
> >>
> >
> 
> >>>
> >
> >


Re: [DISCUSS] FLIP-240: Introduce "ANALYZE TABLE" Syntax

2022-06-12 Thread godfrey he
Hi Ingo,

The semantics does not distinguish batch and streaming,
It works for both batch and streaming, but the result of
unbounded sources is meaningless.
Currently, I throw exception for streaming mode,
and we can support streaming mode with bounded source
in the future.

Best,
Godfrey

Ingo Bürk  于2022年6月13日周一 14:17写道:
>
> Hi Godfrey,
>
> thank you for the explanation. A SELECT is definitely more generic and
> will work for all connectors automatically. As such I think it's a good
> baseline solution regardless.
>
> We can also think about allowing connector-specific optimizations in the
> future, but I do like your idea of letting the optimizer rules perform a
> lot of the work here already by leveraging existing optimizations.
> Similarly things like non-null counts of non-nullable columns would (or
> at least could) be handled by the optimizer rules already.
>
> So as far as that point goes, +1 to the generic approach.
>
> One more point, though: In general we should avoid supporting features
> only in specific modes as it breaks the unification promise. Given that
> ANALYZE is a manual and completely optional operation I'm OK with doing
> that here in principle. However, I wonder what will happen in the
> streaming / unbounded case. Do you plan to throw an error? Or do we
> complete the command as successful but without doing anything?
>
>
> Best
> Ingo
>
> On 13.06.22 05:50, godfrey he wrote:
> > Hi Ingo,
> >
> > Thanks for the inputs.
> >
> > I think converting `ANALYZE TABLE` to `SELECT` statement is
> > more generic approach. Because query plan optimization is more generic,
> >   we can provide more optimization rules to optimize not only `SELECT` 
> > statement
> > converted from `ANALYZE TABLE` but also the `SELECT` statement written by 
> > users.
> >
> >> JDBC connector can get a row count estimate without performing a
> >> SELECT COUNT(1)
> > To optimize such cases, we can implement a rule to push aggregate into
> > table source.
> > Currently, there is a similar rule: SupportsAggregatePushDown, which
> > supports only pushing
> > local aggregate into source now.
> >
> >
> > Best,
> > Godfrey
> >
> > Ingo Bürk  于2022年6月10日周五 17:15写道:
> >>
> >> Hi Godfrey,
> >>
> >> compared to the solution proposed in the FLIP (using a SELECT
> >> statement), I wonder if you have considered adding APIs to catalogs /
> >> connectors to perform this task as an alternative?
> >> I could imagine that for many connectors, statistics could be
> >> implemented in a less expensive way by leveraging the underlying system
> >> (e.g. a JDBC connector can get a row count estimate without performing a
> >> SELECT COUNT(1)).
> >>
> >>
> >> Best
> >> Ingo
> >>
> >>
> >> On 10.06.22 09:53, godfrey he wrote:
> >>> Hi all,
> >>>
> >>> I would like to open a discussion on FLIP-240:  Introduce "ANALYZE
> >>> TABLE" Syntax.
> >>>
> >>> As FLIP-231 mentioned, statistics are one of the most important inputs
> >>> to the optimizer. Accurate and complete statistics allows the
> >>> optimizer to be more powerful. "ANALYZE TABLE" syntax is a very common
> >>> but effective approach to gather statistics, which is already
> >>> introduced by many compute engines and databases.
> >>>
> >>> The main purpose of  discussion is to introduce "ANALYZE TABLE" syntax
> >>> for Flink sql.
> >>>
> >>> You can find more details in FLIP-240 document[1]. Looking forward to
> >>> your feedback.
> >>>
> >>> [1] 
> >>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=217386481
> >>> [2] POC: https://github.com/godfreyhe/flink/tree/FLIP-240
> >>>
> >>>
> >>> Best,
> >>> Godfrey


Re: [DISCUSS] FLIP-240: Introduce "ANALYZE TABLE" Syntax

2022-06-12 Thread godfrey he
Hi Ingo,

Thanks for the inputs.

I think converting `ANALYZE TABLE` to `SELECT` statement is
more generic approach. Because query plan optimization is more generic,
 we can provide more optimization rules to optimize not only `SELECT` statement
converted from `ANALYZE TABLE` but also the `SELECT` statement written by users.

> JDBC connector can get a row count estimate without performing a
> SELECT COUNT(1)
To optimize such cases, we can implement a rule to push aggregate into
table source.
Currently, there is a similar rule: SupportsAggregatePushDown, which
supports only pushing
local aggregate into source now.


Best,
Godfrey

Ingo Bürk  于2022年6月10日周五 17:15写道:
>
> Hi Godfrey,
>
> compared to the solution proposed in the FLIP (using a SELECT
> statement), I wonder if you have considered adding APIs to catalogs /
> connectors to perform this task as an alternative?
> I could imagine that for many connectors, statistics could be
> implemented in a less expensive way by leveraging the underlying system
> (e.g. a JDBC connector can get a row count estimate without performing a
> SELECT COUNT(1)).
>
>
> Best
> Ingo
>
>
> On 10.06.22 09:53, godfrey he wrote:
> > Hi all,
> >
> > I would like to open a discussion on FLIP-240:  Introduce "ANALYZE
> > TABLE" Syntax.
> >
> > As FLIP-231 mentioned, statistics are one of the most important inputs
> > to the optimizer. Accurate and complete statistics allows the
> > optimizer to be more powerful. "ANALYZE TABLE" syntax is a very common
> > but effective approach to gather statistics, which is already
> > introduced by many compute engines and databases.
> >
> > The main purpose of  discussion is to introduce "ANALYZE TABLE" syntax
> > for Flink sql.
> >
> > You can find more details in FLIP-240 document[1]. Looking forward to
> > your feedback.
> >
> > [1] 
> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=217386481
> > [2] POC: https://github.com/godfreyhe/flink/tree/FLIP-240
> >
> >
> > Best,
> > Godfrey


Re: [DISCUSS] FLIP-240: Introduce "ANALYZE TABLE" Syntax

2022-06-12 Thread godfrey he
Hi cao,

Thanks for the feedback.
AFAK, unlike databases'  behavior, the statistics will not collected
automatically
when writing data for many big data compute engines.
FLIP-231[1] has introduced SupportsStatisticsReport interface which the planner
will collect the statistics from connector when statistics from
catalog is unknown.
But the statistics from connector usually has partial information.
Typically, the number
of distinct values will not included.
`ANALYZE TABLE` provides a way of updating complete statistical
information manually.
This is also provided by many big data compute engines and databases.

[1] 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-231%3A+Introduce+SupportsStatisticReport+to+support+reporting+statistics+from+source+connectors

Best,
Godfrey

cao zou  于2022年6月10日周五 16:49写道:
>
> Hi godfrey, Thanks for driving this meaningful topic.
> I think statistics are essential and meaningful for the optimizer, I'm just
> wondering which situation is needed. From the user side, the optimizer
> should be executed by the framework, maybe they do not want to consider too
> much about it. Could you share more situations about using 'ANALYZE TABLE'
> from the user side?
>
> nit: There maybe exists a mistake in Examples#partition table
> the partition info should be
>
> Partition1: (ds='2022-06-01', hr=1)
>
> Partition2: (ds='2022-06-01', hr=2)
>
> Partition3: (ds='2022-06-02', hr=1)
>
> Partition4: (ds='2022-06-02', hr=2)
>
> best
>  zoucao
>
>
> godfrey he  于2022年6月10日周五 15:54写道:
>
> > Hi all,
> >
> > I would like to open a discussion on FLIP-240:  Introduce "ANALYZE
> > TABLE" Syntax.
> >
> > As FLIP-231 mentioned, statistics are one of the most important inputs
> > to the optimizer. Accurate and complete statistics allows the
> > optimizer to be more powerful. "ANALYZE TABLE" syntax is a very common
> > but effective approach to gather statistics, which is already
> > introduced by many compute engines and databases.
> >
> > The main purpose of  discussion is to introduce "ANALYZE TABLE" syntax
> > for Flink sql.
> >
> > You can find more details in FLIP-240 document[1]. Looking forward to
> > your feedback.
> >
> > [1]
> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=217386481
> > [2] POC: https://github.com/godfreyhe/flink/tree/FLIP-240
> >
> >
> > Best,
> > Godfrey
> >


[DISCUSS] FLIP-240: Introduce "ANALYZE TABLE" Syntax

2022-06-10 Thread godfrey he
Hi all,

I would like to open a discussion on FLIP-240:  Introduce "ANALYZE
TABLE" Syntax.

As FLIP-231 mentioned, statistics are one of the most important inputs
to the optimizer. Accurate and complete statistics allows the
optimizer to be more powerful. "ANALYZE TABLE" syntax is a very common
but effective approach to gather statistics, which is already
introduced by many compute engines and databases.

The main purpose of  discussion is to introduce "ANALYZE TABLE" syntax
for Flink sql.

You can find more details in FLIP-240 document[1]. Looking forward to
your feedback.

[1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=217386481
[2] POC: https://github.com/godfreyhe/flink/tree/FLIP-240


Best,
Godfrey


[jira] [Created] (FLINK-27990) Parquet format supports reporting statistics

2022-06-09 Thread godfrey he (Jira)
godfrey he created FLINK-27990:
--

 Summary: Parquet format supports reporting statistics
 Key: FLINK-27990
 URL: https://issues.apache.org/jira/browse/FLINK-27990
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / Planner
Affects Versions: 1.16.0
Reporter: godfrey he






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (FLINK-27991) ORC format supports reporting statistics

2022-06-09 Thread godfrey he (Jira)
godfrey he created FLINK-27991:
--

 Summary: ORC format supports reporting statistics
 Key: FLINK-27991
 URL: https://issues.apache.org/jira/browse/FLINK-27991
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / Planner
Reporter: godfrey he
 Fix For: 1.16.0






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (FLINK-27989) CSV format supports reporting statistics

2022-06-09 Thread godfrey he (Jira)
godfrey he created FLINK-27989:
--

 Summary: CSV format supports reporting statistics
 Key: FLINK-27989
 URL: https://issues.apache.org/jira/browse/FLINK-27989
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / Planner
Reporter: godfrey he
 Fix For: 1.16.0






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (FLINK-27988) Let HiveTableSource extend from SupportsStatisticReport

2022-06-09 Thread godfrey he (Jira)
godfrey he created FLINK-27988:
--

 Summary: Let HiveTableSource extend from SupportsStatisticReport
 Key: FLINK-27988
 URL: https://issues.apache.org/jira/browse/FLINK-27988
 Project: Flink
  Issue Type: Sub-task
Reporter: godfrey he






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (FLINK-27987) Let FileSystemTableSource extend from SupportsStatisticReport

2022-06-09 Thread godfrey he (Jira)
godfrey he created FLINK-27987:
--

 Summary: Let FileSystemTableSource extend from 
SupportsStatisticReport
 Key: FLINK-27987
 URL: https://issues.apache.org/jira/browse/FLINK-27987
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / Planner
Reporter: godfrey he
 Fix For: 1.16.0






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (FLINK-27985) Introduce FlinkRecomputeStatisticsProgram to compute statistics after filter push and partition pruning

2022-06-09 Thread godfrey he (Jira)
godfrey he created FLINK-27985:
--

 Summary: Introduce FlinkRecomputeStatisticsProgram to compute 
statistics after filter push and partition pruning
 Key: FLINK-27985
 URL: https://issues.apache.org/jira/browse/FLINK-27985
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / Planner
Reporter: godfrey he
 Fix For: 1.16.0






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (FLINK-27984) Introduce FileBasedStatisticsReportableDecodingFormat interface

2022-06-09 Thread godfrey he (Jira)
godfrey he created FLINK-27984:
--

 Summary: Introduce FileBasedStatisticsReportableDecodingFormat 
interface
 Key: FLINK-27984
 URL: https://issues.apache.org/jira/browse/FLINK-27984
 Project: Flink
  Issue Type: Sub-task
Reporter: godfrey he






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (FLINK-27983) Introduce SupportsStatisticsReport interface

2022-06-09 Thread godfrey he (Jira)
godfrey he created FLINK-27983:
--

 Summary: Introduce SupportsStatisticsReport interface
 Key: FLINK-27983
 URL: https://issues.apache.org/jira/browse/FLINK-27983
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / Planner
Reporter: godfrey he
 Fix For: 1.16.0






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (FLINK-27982) FLIP-231: Introduce SupportsStatisticReport to support reporting statistics from source connectors

2022-06-09 Thread godfrey he (Jira)
godfrey he created FLINK-27982:
--

 Summary: FLIP-231: Introduce SupportsStatisticReport to support 
reporting statistics from source connectors
 Key: FLINK-27982
 URL: https://issues.apache.org/jira/browse/FLINK-27982
 Project: Flink
  Issue Type: New Feature
  Components: Table SQL / Planner
Reporter: godfrey he
 Fix For: 1.16.0


https://cwiki.apache.org/confluence/pages/resumedraft.action?draftId=211883860&draftShareId=eda17eaa-43f9-4dc1-9a7d-3a9b5a4bae00&;



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[RESULT][VOTE] FLIP-231: Introduce SupportsStatisticReport to support reporting statistics from source connectors

2022-06-09 Thread godfrey he
Hi, everyone.

FLIP-231: Introduce SupportsStatisticReport to support reporting
statistics from source connectors[1] has been accepted.

There are 5 binding votes, 1 non-binding votes[2].
- Jing Ge(non-binding)
- Jark Wu(binding)
- Jingsong Li(binding)
- Martijn Visser(binding)
- Jing Zhang(binding)
- Leonard Xu(binding)

None against.

Thanks again for every one who concerns on this FLIP.


[1]
https://cwiki.apache.org/confluence/pages/resumedraft.action?draftId=211883860&draftShareId=eda17eaa-43f9-4dc1-9a7d-3a9b5a4bae00&;
[2] https://lists.apache.org/thread/j1mqblpbp60hgwg2fnhp44cktfp76zd2


Best,
Godfrey


Re: [DISCUSS] FLIP-222: Support full query lifecycle statements in SQL client

2022-06-08 Thread godfrey he
t;>>> Speaking of ETL DAG, we might want to see the lineage. Is it possible to 
>>>> support syntax like:
>>>>
>>>> SHOW JOBTREE   // shows the downstream DAG from the given job_id
>>>> SHOW JOBTREE  FULL // shows the whole DAG that contains the given 
>>>> job_id
>>>> SHOW JOBTREES // shows all DAGs
>>>> SHOW ANCIENTS  // shows all parents of the given job_id
>>>>
>>>> 3)
>>>> Could we also support Savepoint housekeeping syntax? We ran into this 
>>>> issue that a lot of savepoints have been created by customers (via their 
>>>> apps). It will take extra (hacking) effort to clean it.
>>>>
>>>> RELEASE SAVEPOINT ALL
>>>>
>>>> Best regards,
>>>> Jing
>>>>
>>>> On Tue, Jun 7, 2022 at 2:35 PM Martijn Visser  
>>>> wrote:
>>>>>
>>>>> Hi Paul,
>>>>>
>>>>> I'm still doubting the keyword for the SQL applications. SHOW QUERIES 
>>>>> could
>>>>> imply that this will actually show the query, but we're returning IDs of
>>>>> the running application. At first I was also not very much in favour of
>>>>> SHOW JOBS since I prefer calling it 'Flink applications' and not 'Flink
>>>>> jobs', but the glossary [1] made me reconsider. I would +1 SHOW/STOP JOBS
>>>>>
>>>>> Also +1 for the CREATE/SHOW/DROP SAVEPOINT syntax.
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Martijn
>>>>>
>>>>> [1]
>>>>> https://nightlies.apache.org/flink/flink-docs-master/docs/concepts/glossary
>>>>>
>>>>> Op za 4 jun. 2022 om 10:38 schreef Paul Lam :
>>>>>
>>>>> > Hi Godfrey,
>>>>> >
>>>>> > Sorry for the late reply, I was on vacation.
>>>>> >
>>>>> > It looks like we have a variety of preferences on the syntax, how about 
>>>>> > we
>>>>> > choose the most acceptable one?
>>>>> >
>>>>> > WRT keyword for SQL jobs, we use JOBS, thus the statements related to 
>>>>> > jobs
>>>>> > would be:
>>>>> >
>>>>> > - SHOW JOBS
>>>>> > - STOP JOBS  (with options `table.job.stop-with-savepoint` and
>>>>> > `table.job.stop-with-drain`)
>>>>> >
>>>>> > WRT savepoint for SQL jobs, we use the `CREATE/DROP` pattern with `FOR
>>>>> > JOB`:
>>>>> >
>>>>> > - CREATE SAVEPOINT  FOR JOB 
>>>>> > - SHOW SAVEPOINTS FOR JOB  (show savepoints the current job
>>>>> > manager remembers)
>>>>> > - DROP SAVEPOINT 
>>>>> >
>>>>> > cc @Jark @ShengKai @Martijn @Timo .
>>>>> >
>>>>> > Best,
>>>>> > Paul Lam
>>>>> >
>>>>> >
>>>>> > godfrey he  于2022年5月23日周一 21:34写道:
>>>>> >
>>>>> >> Hi Paul,
>>>>> >>
>>>>> >> Thanks for the update.
>>>>> >>
>>>>> >> >'SHOW QUERIES' lists all jobs in the cluster, no limit on APIs
>>>>> >> (DataStream or SQL) or
>>>>> >> clients (SQL client or CLI).
>>>>> >>
>>>>> >> Is DataStream job a QUERY? I think not.
>>>>> >> For a QUERY, the most important concept is the statement. But the
>>>>> >> result does not contain this info.
>>>>> >> If we need to contain all jobs in the cluster, I think the name should
>>>>> >> be JOB or PIPELINE.
>>>>> >> I learn to SHOW PIPELINES and STOP PIPELINE [IF RUNNING] id.
>>>>> >>
>>>>> >> > SHOW SAVEPOINTS
>>>>> >> To list the savepoint for a specific job, we need to specify a
>>>>> >> specific pipeline,
>>>>> >> the syntax should be SHOW SAVEPOINTS FOR PIPELINE id
>>>>> >>
>>>>> >> Best,
>>>>> >> Godfrey
>>>>> >>
>>>>> >> Paul Lam  于2022年5月20日周五 11:25写道:
>>>>> >> >
>>>>> >> > Hi Jark,
>>>>> >> >
>>>&

Re: [VOTE] FLIP-223: Support HiveServer2 Endpoint

2022-06-08 Thread godfrey he
+1

Best,
Godfrey

Jark Wu  于2022年6月7日周二 17:21写道:
>
> +1 (binding)
>
> Best,
> Jark
>
> On Tue, 7 Jun 2022 at 13:32, Shengkai Fang  wrote:
>
> > Hi, everyone.
> >
> > Thanks for all feedback for FLIP-223: Support HiveServer2 Endpoint[1] on
> > the discussion thread[2]. I'd like to start a vote for it. The vote will be
> > open for at least 72 hours unless there is an objection or not enough
> > votes.
> >
> > Best,
> > Shengkai
> >
> >
> > [1]
> >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-223%3A+Support+HiveServer2+Endpoint
> > [2] https://lists.apache.org/thread/9r1j7ho2m8zbqy3tl7vvj9gnocggwr6x
> >


Re: [VOTE] FLIP-234: Support Retryable Lookup Join To Solve Delayed Updates Issue In External Systems

2022-06-08 Thread godfrey he
+1

Best,
Godfrey

Jingsong Li  于2022年6月9日周四 10:26写道:
>
> +1 (binding)
>
> Best,
> Jingsong
>
> On Tue, Jun 7, 2022 at 5:21 PM Jark Wu  wrote:
> >
> > +1 (binding)
> >
> > Best,
> > Jark
> >
> > On Tue, 7 Jun 2022 at 12:17, Lincoln Lee  wrote:
> >
> > > Dear Flink developers,
> > >
> > > Thanks for all your feedback for FLIP-234: Support Retryable Lookup Join 
> > > To
> > > Solve Delayed Updates Issue In External Systems[1] on the discussion
> > > thread[2].
> > >
> > > I'd like to start a vote for it. The vote will be open for at least 72
> > > hours unless there is an objection or not enough votes.
> > >
> > > [1]
> > >
> > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-234%3A+Support+Retryable+Lookup+Join+To+Solve+Delayed+Updates+Issue+In+External+Systems
> > > [2] https://lists.apache.org/thread/9k1sl2519kh2n3yttwqc00p07xdfns3h
> > >
> > > Best,
> > > Lincoln Lee
> > >


[VOTE] FLIP-231: Introduce SupportStatisticReport to support reporting statistics from source connectors

2022-06-06 Thread godfrey he
Hi everyone,

Thanks for all the feedback so far. Based on the discussion[1] we seem
to have consensus, so I would like to start a vote on FLIP-231 for
which the FLIP has now also been updated[2].

The vote will last for at least 72 hours (Jun 10th 12:00 GMT) unless
there is an objection or insufficient votes.

[1] https://lists.apache.org/thread/88kxk7lh8bq2s2c2qrf06f3pnf9fkxj2
[2] 
https://cwiki.apache.org/confluence/pages/resumedraft.action?draftId=211883860&draftShareId=eda17eaa-43f9-4dc1-9a7d-3a9b5a4bae00&;

Best,
Godfrey


Re: [DISCUSS] FLIP-223: Support HiveServer2 Endpoint

2022-06-06 Thread godfrey he
Hi, Shengkai.

Thanks for the update, LGTM now.

Best,
Godfrey


Shengkai Fang  于2022年6月6日周一 16:47写道:
>
> Hi. Godfrey.
>
> Nice to hear the comments from you.
>
> > Could you give a whole architecture about the Ecosystem of HiveServers
> > and the SqlGateway, such as JDBC driver, Beeline, etc.
> > Which is more clear for users.
>
> Yes. I have updated the FLIP and added the architecture of the Gateway with
> the HiveServer2 endpoint.
>
> > How To Use
> >> Could you give a complete example to describe an end-to-end case?
>
> Yes. I have updated the FLIP. The beeline users can just use the connect
> command to connect to the SQLGateway with the HiveServer2 endpoint.
> For example, users just inputs "!connect
> jdbc:hive2://:/;auth=noSasl
> hiveuser pass" into the terminal to connect to the SQLGateway.
>
> > Is the streaming SQL supported? What's the behavior if I submit a
> streaming query or I change the dialect to 'default'?
> Yes. We don't limit the usage here. Users can switch to the streaming mode
> or use the default dialect.  But we don't suggest users use the hive
> dialect in the streaming mode. As far as I know, it has some problems that
> are not fixed yet, e.g. you may get errors for SQL that works in the batch
> mode. I added a section to mention this.
>
> > Considering the different users may have different requirements to
> connect to different meta stores,
> > they can use the DDL to register the HiveCatalog that satisfies their
> requirements.
> >> Could you give some examples to explain it more?
>
> Hive supports setting multiple metastore addresses via the config option
> "hive.metastore.urls". Here I just mean users can switch to connect to
> different metastore instances using the CREATE CATALOG DDL. I updated the
> FLIP to make it more clear.
>
> Best,
> Shengkai
>
> godfrey he  于2022年6月6日周一 13:45写道:
>
> > Hi Shengkai,
> >
> > Thanks for driving this.
> >
> > I have a few comments:
> >
> > Could you give a whole architecture about the Ecosystem of HiveServers
> > and the SqlGateway, such as JDBC driver, Beeline, etc.
> > Which is more clear for users.
> >
> > > Considering the different users may have different requirements to
> > connect to different meta stores,
> > > they can use the DDL to register the HiveCatalog that satisfies their
> > requirements.
> >  Could you give some examples to explain it more?
> >
> > > How To Use
> > Could you a complete example to describe an end-to-end case?
> >
> > Is the streaming sql supported? What's the behavior if I submit streaming
> > query
> > or I change the dialect to 'default'?
> >
> > Best,
> > Godfrey
> >
> > Shengkai Fang  于2022年6月1日周三 21:13写道:
> > >
> > > Hi, Jingsong.
> > >
> > > Thanks for your feedback.
> > >
> > > > I've read the FLIP and it's not quite clear what the specific
> > unsupported
> > > items are
> > >
> > > Yes. I have added a section named Difference with HiveServer2 and list
> > the
> > > difference between the SQL Gateway with HiveServer2 endpoint and
> > > HiveServer2.
> > >
> > > > Support multiple metastore clients in one gateway?
> > >
> > > Yes. It may cause class conflicts when using the different versions of
> > Hive
> > > Catalog at the same time. I add a section named "How to use" to remind
> > the
> > > users don't use HiveCatalog with different versions together.
> > >
> > > >  Hive versions and setup
> > >
> > > Considering the HiveServer2 endpoint binds to the HiveCatalog, we will
> > not
> > > introduce a new module about the HiveServer2 endpoint. The current
> > > dependencies in the hive connector should be enough for the HiveServer2
> > > Endpoint except for the hive-service-RPC(it contains the HiveServer2
> > > interface). In this way, the hive connector jar will contain an
> > endpoint. I
> > > add a section named "Merge HiveServer2 Endpoint into Hive Connector
> > > Module".
> > >
> > > For usage, the user can just add the hive connector jar into the
> > classpath
> > > and use the sql-gateway.sh to start the SQL Gateway with the hiveserver2
> > > endpoint.  You can refer to the section "How to use" for more details.
> > >
> > > Best,
> > > Shengkai
> > >
> > > Jingsong Li  于2022年6月1日周三 15:04

Re: [DISCUSS] FLIP-231: Introduce SupportStatisticReport to support reporting statistics from source connectors

2022-06-06 Thread godfrey he
Hi, everyone.

Thanks for all the inputs.
Since there is no feedback any more, I will start the vote tomorrow.

Best,
Godfrey

godfrey he  于2022年6月1日周三 13:30写道:
>
> Hi, Jing.
>
> Thanks for the suggestion, I have updated the doc and
> will continue to optimize the code in subsequent PR.
>
> Best,
> Godfrey
>
> Jing Ge  于2022年6月1日周三 04:41写道:
> >
> > Hi Godfrey,
> >
> > Thanks for clarifying it. I personally prefer the new change you suggested.
> >
> > Would you please help to understand one more thing? The else if
> > (filterPushDownSpec != null) branch is the only branch that doesn't have to
> > check if the newTableStat has been calculated previously. The reason might
> > be, after the filter has been pushed down to the table source,
> > ((SupportStatisticReport) tableSource).reportStatistics() will return a new
> > TableStats, which turns out the newTableStat has to be re-computed for each
> > filter push down. In this case, it might be good to add it into the FLIP
> > description. Otherwise, We could also add optimization for this branch to
> > avoid re-computing the table statistics.
> >
> > NIT: There are many conditions in the nested if-else statements. In order
> > to improve the readability (and maintainability in the future), we could
> > consider moving up some common checks like collectStatEnabled or
> > tableSource instanceof SupportStatisticReport, e.g.:
> >
> >  private LogicalTableScan collectStatistics(LogicalTableScan scan) {
> >   ..
> >  FlinkStatistic newStatistic = FlinkStatistic.builder()
> > .statistic(table.getStatistic()) .tableStats(refreshTableStat(...))
> > .build();
> >  return new LogicalTableScan( scan.getCluster(), scan.getTraitSet(),
> > scan.getHints(), table.copy(newStatistic));
> > }
> >
> > private TableStats refreshTableStat(boolean collectStatEnabled,
> > TableSourceTable table , PartitionPushDownSpec partitionPushDownSpec,
> > FilterPushDownSpec filterPushDownSpec) {
> >
> >  if (!collectStatEnabled)   return null;
> >
> >  if (!(table.tableSource() instanceof SupportStatisticReport)) return
> > null;
> >
> >  SupportStatisticReport tableSource =
> > (SupportStatisticReport)table.tableSource();
> >
> >   if (filterPushDownSpec != null) {
> >  // filter push down, no matter  partition push down or not
> >  return  tableSource.reportStatistics();
> >  } else {
> >  if (partitionPushDownSpec != null) {
> > // partition push down, while filter not
> >  } else {
> >  // no partition and filter push down
> >  return table.getStatistic().getTableStats() ==
> > TableStats.UNKNOWN ? tableSource.reportStatistics() :
> > table.getStatistic().getTableStats();
> >     }
> >  }
> > }
> >
> > This just improved a little bit without introducing some kind of
> > Action/Performer interface with many subclasses and factory class to get
> > rid of some if-else statements, which could optionally be the next step
> > provement in the future.
> >
> > Best regards,
> > Jing
> >
> > On Tue, May 31, 2022 at 3:42 PM godfrey he  wrote:
> >
> > > Hi Jark and Jing,
> > >
> > > +1 to use "report" instead of "collect".
> > >
> > > >  // only filter push down (*the description means
> > > > partitionPushDownSpec == null but misses the case of
> > > > partitionPushDownSpec != null*)
> > >
> > > `if (partitionPushDownSpec != null && filterPushDownSpec == null)`
> > > this branch is only consider that the partition is partition is pushed
> > > down,
> > > but no filter is push down. The planner will collect the statistics
> > > from catalog first,
> > > and then try to collect the statistics from collectors if the catalog
> > > statistics is unknown.
> > >
> > > `else if (filterPushDownSpec != null)` this branch means  whether
> > > the partitionPushDownSpec is null or not, the planner will collect
> > > statistics from
> > > collectors only, because the catalog do not support get statistics with
> > > filters.
> > >
> > > `else if (collectStatEnabled
> > > && (table.getStatistic().getTableStats() ==
> > > TableStats.UNKNOWN)
> > > && tableSource instanceof SupportStatisticReport)`
> > > this branch means no partition and no filter are pushed dow

Re: [DISCUSS] FLIP-223: Support HiveServer2 Endpoint

2022-06-05 Thread godfrey he
Hi Shengkai,

Thanks for driving this.

I have a few comments:

Could you give a whole architecture about the Ecosystem of HiveServers
and the SqlGateway, such as JDBC driver, Beeline, etc.
Which is more clear for users.

> Considering the different users may have different requirements to connect to 
> different meta stores,
> they can use the DDL to register the HiveCatalog that satisfies their 
> requirements.
 Could you give some examples to explain it more?

> How To Use
Could you a complete example to describe an end-to-end case?

Is the streaming sql supported? What's the behavior if I submit streaming query
or I change the dialect to 'default'?

Best,
Godfrey

Shengkai Fang  于2022年6月1日周三 21:13写道:
>
> Hi, Jingsong.
>
> Thanks for your feedback.
>
> > I've read the FLIP and it's not quite clear what the specific unsupported
> items are
>
> Yes. I have added a section named Difference with HiveServer2 and list the
> difference between the SQL Gateway with HiveServer2 endpoint and
> HiveServer2.
>
> > Support multiple metastore clients in one gateway?
>
> Yes. It may cause class conflicts when using the different versions of Hive
> Catalog at the same time. I add a section named "How to use" to remind the
> users don't use HiveCatalog with different versions together.
>
> >  Hive versions and setup
>
> Considering the HiveServer2 endpoint binds to the HiveCatalog, we will not
> introduce a new module about the HiveServer2 endpoint. The current
> dependencies in the hive connector should be enough for the HiveServer2
> Endpoint except for the hive-service-RPC(it contains the HiveServer2
> interface). In this way, the hive connector jar will contain an endpoint. I
> add a section named "Merge HiveServer2 Endpoint into Hive Connector
> Module".
>
> For usage, the user can just add the hive connector jar into the classpath
> and use the sql-gateway.sh to start the SQL Gateway with the hiveserver2
> endpoint.  You can refer to the section "How to use" for more details.
>
> Best,
> Shengkai
>
> Jingsong Li  于2022年6月1日周三 15:04写道:
>
> > Hi Shengkai,
> >
> > Thanks for driving.
> >
> > I have a few comments:
> >
> > ## Unsupported features
> >
> > I've read the FLIP and it's not quite clear what the specific unsupported
> > items are?
> > - For example, security related, is it not supported.
> > - For example, is there a loss of precision for types
> > - For example, the FetchResults are not the same
> >
> > ## Support multiple metastore clients in one gateway?
> >
> > > During the setup, the HiveServer2 tires to load the config in the
> > hive-site.xml to initialize the Hive metastore client. In the Flink, we use
> > the Catalog interface to connect to the Hive Metastore, which is allowed to
> > communicate with different Hive Metastore[1]. Therefore, we allows the user
> > to specify the path of the hive-site.xml as the endpoint parameters, which
> > will used to create the default HiveCatalog in the Flink. Considering the
> > different users may have different requirements to connect to different
> > meta stores, they can use the DDL to register the HiveCatalog that
> > satisfies their requirements.
> >
> > I understand it is difficult. You really want to support?
> >
> > ## Hive versions and setup
> >
> > I saw jark also commented, but FLIP does not seem to have been modified,
> > how should the user setup, which jar to add, which hive metastore version
> > to support? How to setup to support?
> >
> > Best,
> > Jingsong
> >
> > On Tue, May 24, 2022 at 11:57 AM Shengkai Fang  wrote:
> >
> > > Hi, all.
> > >
> > > Considering we start to vote for FLIP-91 for a while, I think we can
> > > restart the discussion about the FLIP-223.
> > >
> > > I am glad that you can give some feedback about FLIP-223.
> > >
> > > Best,
> > > Shengkai
> > >
> > >
> > > Martijn Visser  于2022年5月6日周五 19:10写道:
> > >
> > > > Hi Shengkai,
> > > >
> > > > Thanks for clarifying.
> > > >
> > > > Best regards,
> > > >
> > > > Martijn
> > > >
> > > > On Fri, 6 May 2022 at 08:40, Shengkai Fang  wrote:
> > > >
> > > > > Hi Martijn.
> > > > >
> > > > > > So this implementation would not rely in any way on Hive, only on
> > > > Thrift?
> > > > >
> > > > > Yes.  The dependency is light. We also can just copy the iface file
> > > from
> > > > > the Hive repo and maintain by ourselves.
> > > > >
> > > > > Best,
> > > > > Shengkai
> > > > >
> > > > > Martijn Visser  于2022年5月4日周三 21:44写道:
> > > > >
> > > > > > Hi Shengkai,
> > > > > >
> > > > > > > Actually we will only rely on the API in the Hive, which only
> > > > contains
> > > > > > the thrift file and the generated code
> > > > > >
> > > > > > So this implementation would not rely in any way on Hive, only on
> > > > Thrift?
> > > > > >
> > > > > > Best regards,
> > > > > >
> > > > > > Martijn Visser
> > > > > > https://twitter.com/MartijnVisser82
> > > > > > https://github.com/MartijnVisser
> > > > > >
> > > > > >
> > > > > > On Fri, 29 Apr 2022 at 05:16, Shengkai Fang 
> > > wrote:
> > > > > 

Re: [DISCUSS] FLIP-231: Introduce SupportStatisticReport to support reporting statistics from source connectors

2022-05-31 Thread godfrey he
Hi, Jing.

Thanks for the suggestion, I have updated the doc and
will continue to optimize the code in subsequent PR.

Best,
Godfrey

Jing Ge  于2022年6月1日周三 04:41写道:
>
> Hi Godfrey,
>
> Thanks for clarifying it. I personally prefer the new change you suggested.
>
> Would you please help to understand one more thing? The else if
> (filterPushDownSpec != null) branch is the only branch that doesn't have to
> check if the newTableStat has been calculated previously. The reason might
> be, after the filter has been pushed down to the table source,
> ((SupportStatisticReport) tableSource).reportStatistics() will return a new
> TableStats, which turns out the newTableStat has to be re-computed for each
> filter push down. In this case, it might be good to add it into the FLIP
> description. Otherwise, We could also add optimization for this branch to
> avoid re-computing the table statistics.
>
> NIT: There are many conditions in the nested if-else statements. In order
> to improve the readability (and maintainability in the future), we could
> consider moving up some common checks like collectStatEnabled or
> tableSource instanceof SupportStatisticReport, e.g.:
>
>  private LogicalTableScan collectStatistics(LogicalTableScan scan) {
>   ..
>  FlinkStatistic newStatistic = FlinkStatistic.builder()
> .statistic(table.getStatistic()) .tableStats(refreshTableStat(...))
> .build();
>  return new LogicalTableScan( scan.getCluster(), scan.getTraitSet(),
> scan.getHints(), table.copy(newStatistic));
> }
>
> private TableStats refreshTableStat(boolean collectStatEnabled,
> TableSourceTable table , PartitionPushDownSpec partitionPushDownSpec,
> FilterPushDownSpec filterPushDownSpec) {
>
>  if (!collectStatEnabled)   return null;
>
>  if (!(table.tableSource() instanceof SupportStatisticReport)) return
> null;
>
>  SupportStatisticReport tableSource =
> (SupportStatisticReport)table.tableSource();
>
>   if (filterPushDownSpec != null) {
>  // filter push down, no matter  partition push down or not
>  return  tableSource.reportStatistics();
>  } else {
>  if (partitionPushDownSpec != null) {
> // partition push down, while filter not
>  } else {
>  // no partition and filter push down
>  return table.getStatistic().getTableStats() ==
> TableStats.UNKNOWN ? tableSource.reportStatistics() :
> table.getStatistic().getTableStats();
> }
>  }
> }
>
> This just improved a little bit without introducing some kind of
> Action/Performer interface with many subclasses and factory class to get
> rid of some if-else statements, which could optionally be the next step
> provement in the future.
>
> Best regards,
> Jing
>
> On Tue, May 31, 2022 at 3:42 PM godfrey he  wrote:
>
> > Hi Jark and Jing,
> >
> > +1 to use "report" instead of "collect".
> >
> > >  // only filter push down (*the description means
> > > partitionPushDownSpec == null but misses the case of
> > > partitionPushDownSpec != null*)
> >
> > `if (partitionPushDownSpec != null && filterPushDownSpec == null)`
> > this branch is only consider that the partition is partition is pushed
> > down,
> > but no filter is push down. The planner will collect the statistics
> > from catalog first,
> > and then try to collect the statistics from collectors if the catalog
> > statistics is unknown.
> >
> > `else if (filterPushDownSpec != null)` this branch means  whether
> > the partitionPushDownSpec is null or not, the planner will collect
> > statistics from
> > collectors only, because the catalog do not support get statistics with
> > filters.
> >
> > `else if (collectStatEnabled
> > && (table.getStatistic().getTableStats() ==
> > TableStats.UNKNOWN)
> > && tableSource instanceof SupportStatisticReport)`
> > this branch means no partition and no filter are pushed down.
> >
> > or we can change the pseudocode to:
> >  if (filterPushDownSpec != null) {
> >// filter push down, no mater  partition push down or not
> > } else {
> > if (partitionPushDownSpec != null) {
> > // partition push down, while filter not
> >  } else {
> >  // no partition and filter push down
> >  }
> > }
> >
> > Best,
> > Godfrey
> >
> > Jing Ge  于2022年5月29日周日 08:17写道:
> > >
> > > Hi Godfrey,
> > >
> > > Thanks for driving this FLIP.  It looks really good! Looking forward to
> > it!
&

Re: [DISCUSS] FLIP-231: Introduce SupportStatisticReport to support reporting statistics from source connectors

2022-05-31 Thread godfrey he
Hi Jark and Jing,

+1 to use "report" instead of "collect".

>  // only filter push down (*the description means
> partitionPushDownSpec == null but misses the case of
> partitionPushDownSpec != null*)

`if (partitionPushDownSpec != null && filterPushDownSpec == null)`
this branch is only consider that the partition is partition is pushed down,
but no filter is push down. The planner will collect the statistics
from catalog first,
and then try to collect the statistics from collectors if the catalog
statistics is unknown.

`else if (filterPushDownSpec != null)` this branch means  whether
the partitionPushDownSpec is null or not, the planner will collect
statistics from
collectors only, because the catalog do not support get statistics with filters.

`else if (collectStatEnabled
&& (table.getStatistic().getTableStats() == TableStats.UNKNOWN)
&& tableSource instanceof SupportStatisticReport)`
this branch means no partition and no filter are pushed down.

or we can change the pseudocode to:
 if (filterPushDownSpec != null) {
   // filter push down, no mater  partition push down or not
} else {
if (partitionPushDownSpec != null) {
// partition push down, while filter not
 } else {
 // no partition and filter push down
 }
}

Best,
Godfrey

Jing Ge  于2022年5月29日周日 08:17写道:
>
> Hi Godfrey,
>
> Thanks for driving this FLIP.  It looks really good! Looking forward to it!
>
> If I am not mistaken, partition pruning could also happen in the following
> pseudocode condition block:
>
> else if (filterPushDownSpec != null) {
> // only filter push down (*the description means
> partitionPushDownSpec == null but misses the case of
> partitionPushDownSpec != null*)
>
> // the catalog do not support get statistics with filters,
> // so only call reportStatistics method if needed
> if (collectStatEnabled && tableSource instanceof
> SupportStatisticReport) {
> newTableStat = ((SupportStatisticReport)
> tableSource).reportStatistics();
> }
>
>
> Best regards,
>
> Jing
>
>
> On Sat, May 28, 2022 at 5:09 PM Jark Wu  wrote:
>
> > Hi Godfrey,
> >
> > It seems that the "SupportStatisticReport" interface name and
> >  "table.optimizer.source.connect-statistics-enabled" option name is not
> > updated in the FLIP.
> >
> > Besides, in the terms of the option name, the meaning of
> > "source.statistics-type"
> > is not very straightforward and clean to me. Maybe
> > "source.report-statistics" = "none/all/file-size"
> > would be better.
> >
> > We can also change "table.optimizer.source.connect-statistics-enabled" to
> > "table.optimizer.source.report-statistics-enabled" for alignment and
> >  it's clear that one for fine-grained and one for coarse-grained.
> >
> >
> > Best,
> > Jark
> >
> > On Fri, 27 May 2022 at 22:58, godfrey he  wrote:
> >
> > > Hi, everyone.
> > >
> > > Thanks for all the inputs.
> > > If there is no more feedback, I think we can start the vote next monday.
> > >
> > > Best,
> > > Godfrey
> > >
> > > Martijn Visser  于2022年5月25日周三 19:46写道:
> > > >
> > > > Hi Godfrey,
> > > >
> > > > Thanks for creating the FLIP. I have no comments.
> > > >
> > > > Best regards,
> > > >
> > > > Martijn
> > > >
> > > >
> > > > On Tue, 17 May 2022 at 04:52, Jingsong Li 
> > > wrote:
> > > >
> > > > > Hi Godfrey,
> > > > >
> > > > > Thanks for your reply.
> > > > >
> > > > > Sounds good to me.
> > > > >
> > > > > > I think we should also introduce a config option
> > > > >
> > > > > We can add this option to the FLIP. I prefer a option for
> > > > > FileSystemConnector, maybe a enum.
> > > > >
> > > > > Best,
> > > > > Jingsong
> > > > >
> > > > > On Tue, May 17, 2022 at 10:31 AM godfrey he 
> > > wrote:
> > > > >
> > > > > > Hi Jingsong,
> > > > > >
> > > > > > Thanks for the feedback.
> > > > > >
> > > > > >
> > > > > > >One concern I have is that we read the footer for each file, and
> > > this
> > > > > may
> > > > > > >b

Re: [DISCUSS] FLIP-231: Introduce SupportStatisticReport to support reporting statistics from source connectors

2022-05-27 Thread godfrey he
Hi, everyone.

Thanks for all the inputs.
If there is no more feedback, I think we can start the vote next monday.

Best,
Godfrey

Martijn Visser  于2022年5月25日周三 19:46写道:
>
> Hi Godfrey,
>
> Thanks for creating the FLIP. I have no comments.
>
> Best regards,
>
> Martijn
>
>
> On Tue, 17 May 2022 at 04:52, Jingsong Li  wrote:
>
> > Hi Godfrey,
> >
> > Thanks for your reply.
> >
> > Sounds good to me.
> >
> > > I think we should also introduce a config option
> >
> > We can add this option to the FLIP. I prefer a option for
> > FileSystemConnector, maybe a enum.
> >
> > Best,
> > Jingsong
> >
> > On Tue, May 17, 2022 at 10:31 AM godfrey he  wrote:
> >
> > > Hi Jingsong,
> > >
> > > Thanks for the feedback.
> > >
> > >
> > > >One concern I have is that we read the footer for each file, and this
> > may
> > > >be a bit costly in some cases. Is it possible for us to have some
> > > > hierarchical way
> > > yes, if there are thousands of orc/parquet files, it may take a long
> > time.
> > > So we can introduce a config option to let the user choose the
> > > granularity of the statistics.
> > > But the SIZE will not be introduced, because the planner does not use
> > > the file size statistics now.
> > > We can introduce once file size statistics is introduce in the future.
> > > I think we should also introduce a config option to enable/disable
> > > SupportStatisticReport,
> > > because it's a heavy operation for some connectors in some cases.
> > >
> > > > is the filter pushdown already happening at
> > > > this time?
> > > That's a good point. Currently, the filter push down is after partition
> > > pruning
> > > to prevent the filter push down rule from consuming the partition
> > > predicates.
> > > The statistics will be set to unknown if filter is pushed down now.
> > > To combine them all, we can create an optimization program after filter
> > > push
> > > down program to collect the statistics. This could avoid collecting
> > > statistics multiple times.
> > >
> > >
> > > Best,
> > > Godfrey
> > >
> > > Jingsong Li  于2022年5月13日周五 22:44写道:
> > > >
> > > > Thank Godfrey for driving.
> > > >
> > > > Looks very good~ This will undoubtedly greatly enhance the various
> > batch
> > > > mode connectors.
> > > >
> > > > I left some comments:
> > > >
> > > > ## FileBasedStatisticsReportableDecodingFormat
> > > >
> > > > One concern I have is that we read the footer for each file, and this
> > may
> > > > be a bit costly in some cases. Is it possible for us to have some
> > > > hierarchical way, e.g.
> > > > - No statistics are collected for files by default.
> > > > - SIZE: Generate statistics based on file Size, get the size of the
> > file
> > > > only with access to the master of the FileSystem.
> > > > - DETAILED: Get the complete statistics by format, possibly by
> > accessing
> > > > the footer of the file.
> > > >
> > > > ## When use the statistics reported by connector
> > > >
> > > > > When partitions are pruned by PushPartitionIntoTableSourceScanRule,
> > the
> > > > statistics should also be updated.
> > > >
> > > > I understand that we definitely need to use reporter after the
> > partition
> > > > prune, but another question: is the filter pushdown already happening
> > at
> > > > this time?
> > > > Can we make sure that in the following three cases, both the filter
> > > > pushdown and the partition prune happen before the stats reporting.
> > > > - only partition prune happens
> > > > - only filter pushdown happens
> > > > - both filter pushdown and partition prune happen
> > > >
> > > > Best,
> > > > Jingsong
> > > >
> > > > On Fri, May 13, 2022 at 6:57 PM godfrey he 
> > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I would like to open a discussion on FLIP-231:  Introduce
> > > > > SupportStatisticReport
> > > > > to support reporting statistics from source connectors.
> > > > >
> > > > > Statistics are on

[jira] [Created] (FLINK-27817) TaskManager metaspace OOM for session cluster

2022-05-27 Thread godfrey he (Jira)
godfrey he created FLINK-27817:
--

 Summary: TaskManager metaspace OOM for session cluster
 Key: FLINK-27817
 URL: https://issues.apache.org/jira/browse/FLINK-27817
 Project: Flink
  Issue Type: New Feature
  Components: Runtime / Task
Reporter: godfrey he


>From user ML: 
>https://www.mail-archive.com/user-zh@flink.apache.org/msg15224.html

For SQL jobs, the most operators are code generated with *unique class name*, 
this will cause the TM metaspace space continued growth until OOM in a session 
cluster.




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (FLINK-27815) Improve the join reorder strategy for batch sql job

2022-05-27 Thread godfrey he (Jira)
godfrey he created FLINK-27815:
--

 Summary: Improve the join reorder strategy for batch sql job 
 Key: FLINK-27815
 URL: https://issues.apache.org/jira/browse/FLINK-27815
 Project: Flink
  Issue Type: New Feature
  Components: Table SQL / Planner
Reporter: godfrey he


Join is heavy operation in the execution, the join order in a query can have a 
significant impact on the query’s performance. 
Currently, the planner has one  join reorder strategy which is provided by 
Apache Calcite,
but it strongly depends on the statistics. It's better we can provide different 
join reorder strategies for different situations, such as:
1. provide a join reorder strategy without statistics, e.g. eliminate cross 
joins
2. improve current join reorders strategy with statistics
3. provide hints to allow users to choose join order strategy
4. ...



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


Re: Re: [DISCUSS] FLIP-218: Support SELECT clause in CREATE TABLE(CTAS)

2022-05-23 Thread godfrey he
Hi Jark,

> "Table#createTableAs(tablePath)" seems a
>little strange to me.

`Table#createTableAs` is a bit misleading, I learn to Table#saveAs(tablePath).

Best,
Godfrey

Jark Wu  于2022年5月18日周三 23:09写道:
>
> Hi Godfrey,
>
> Regarding Table API for CTAS, "Table#createTableAs(tablePath)" seems a
> little strange to me.
> Usually, the parameter after AS should be the query, but the query is in
> front of AS.
> I slightly prefer a method on TableEnvironment besides "createTable" (i.e.
> a special createTable with writing data).
>
> For example:
> void createTableAs(String path, TableDescriptor descriptor, Table query);
>
> Usage:
> tableEnv.createTableAs(
> "T1",
> TableDescriptor.forConnector("hive")
> .option("format", "parquet")
> .build(),
> query);
>
>
> Best,
> Jark
>
> On Wed, 18 May 2022 at 22:53, Jark Wu  wrote:
>
> > Hi Mang,
> >
> > Thanks for proposing this, CTAS is a very important API for batch users.
> >
> > I think the key problem of this FLIP is the ACID semantics of the CTAS
> > operation.
> > We care most about two parts of the semantics:
> > 1) Atomicity: the created table should be rolled back if the write is
> > failed.
> > 2) Isolation: the created table shouldn't be visible before the write is
> > successful (read uncommitted).
> >
> > From your investigation, it seems that:
> > - Flink (your FLIP): none of them.   ==> LEVEL-1
> > - Spark DataSource v1: is atomic (can roll back), but is not isolated. ==>
> > LEVEL-2
> > - Spark DataSource v2: guarantees both of them.  ==> LEVEL-3
> > - Hive MR: guarantees both of them. ==> LEVEL-3
> >
> > In order to support higher ACID semantics, I agree with Godfrey that we
> > need some hooks in JM
> > which can be called when the job is finished or failed/canceled. It might
> > look like
> > `StreamExecutionEnvironment#registerJobListener(JobListener)`,
> > but JobListener is called on the
> > client side. What we need is an interface called on the JM side, because
> > the job can be submitted in
> > detached mode.
> >
> > With this interface, we can easily support LEVEL-2 semantics by calling
> > `Catalog#dropTable` in the
> > `JobListener#onJobFailed`. We can also support LEVEL-3 by introducing
> > `StagingTableCatalog` like Spark,
> > calling `StagedTable#commitStagedChanges()` in `JobListener#onJobFinished`
> > and
> > calling StagedTable#abortStagedChanges() in `JobListener#onJobFailed`.
> >
> > Best,
> > Jark
> >
> >
> > On Wed, 18 May 2022 at 12:29, godfrey he  wrote:
> >
> >> Hi Mang,
> >>
> >> Thanks for driving this FLIP.
> >>
> >> Please follow the FLIP template[1] style, and the `Syntax ` is part of
> >> the `Public API Changes` section.
> >> ‘Program research’ and 'Implementation Plan' are part of the `Proposed
> >> Changes` section,
> >> or move ‘Program research’ to the appendix.
> >>
> >> > Providing methods that are used to execute CTAS for Table API users.
> >> We should introduce `createTable` in `Table` instead of
> >> `TableEnvironment`.
> >> Because all table operations are defined in `Table`, see:
> >> Table#executeInsert,
> >> Table#insertInto, etc.
> >> About the method name, I prefer to use `createTableAs`.
> >>
> >> > TableSink needs to provide the CleanUp API, developers implement as
> >> needed.
> >> I think it's hard for TableSink to implement a clean up operation. For
> >> file system sink,
> >> the data can be written to a temporary directory, but for key/value
> >> sinks, it's hard to
> >> remove the written keys, unless the sink records all written keys.
> >>
> >> > Do not do drop table operations in the framework, drop table is
> >> implemented in
> >> TableSink according to the needs of specific TableSink
> >> The TM process may crash at any time, and the drop operation will not
> >> be executed any more.
> >>
> >> How about we do the drop table operation and cleanup data action in the
> >> catalog?
> >> Where to execute the drop operation. one approach is in client, other is
> >> in JM.
> >> 1. in client: this requires the client to be alive until the job is
> >> finished and failed.
> >> 2. in JM: this requires the JM co

Re: [DISCUSS] FLIP-222: Support full query lifecycle statements in SQL client

2022-05-23 Thread godfrey he
tJobs, the
> >> same with Flink CLI. I think it’s okay to have non-SQL jobs listed in SQL
> >> client, because
> >> these jobs can be managed via SQL client too.
> >>
> >> WRT finished time, I think you’re right. Adding it to the FLIP. But I’m a
> >> bit afraid that the
> >> rows would be too long.
> >>
> >> WRT ‘DROP QUERY’,
> >>> What's the behavior for batch jobs and the non-running jobs?
> >>
> >>
> >> In general, the behavior would be aligned with Flink CLI. Triggering a
> >> savepoint for
> >> a non-running job would cause errors, and the error message would be
> >> printed to
> >> the SQL client. Triggering a savepoint for batch(unbounded) jobs in
> >> streaming
> >> execution mode would be the same with streaming jobs. However, for batch
> >> jobs in
> >> batch execution mode, I think there would be an error, because batch
> >> execution
> >> doesn’t support checkpoints currently (please correct me if I’m wrong).
> >>
> >> WRT ’SHOW SAVEPOINTS’, I’ve thought about it, but Flink clusterClient/
> >> jobClient doesn’t have such a functionality at the moment, neither do
> >> Flink CLI.
> >> Maybe we could make it a follow-up FLIP, which includes the modifications
> >> to
> >> clusterClient/jobClient and Flink CLI. WDYT?
> >>
> >> Best,
> >> Paul Lam
> >>
> >>> 2022年5月17日 20:34,godfrey he  写道:
> >>>
> >>> Godfrey
> >>
> >>
>


Re: [DISCUSS] FLIP-231: Introduce SupportStatisticReport to support reporting statistics from source connectors

2022-05-23 Thread godfrey he
Hi, Jark

Thanks for the feedback.

> 1) All the ability interfaces begin with "Supports" instead of "Support".
+1

> The "connect" word should be "collect"?
Yes, it's a typo.

> CatalogStatistics
Yes, we should use TableStats.
I forgot that TableStats and ColumnStats have been ported to the API module.

> What's the difference between them?
table.optimizer.source.collect-statistics-enabled is used for all collectors,
while source.statistics-type is file base connectors.
It may take a long time to get the detailed statistics,
but may be the file size (will be introduced later) is enough.

> IMO, we should also support Hive source as well in this FLIP.
+1

Best,
Godfrey

Jark Wu  于2022年5月20日周五 12:04写道:
>
> Hi Godfrey,
>
> I just left some comments here:
>
> 1) SupportStatisticReport => SupportsStatisticReport
> All the ability interfaces begin with "Supports" instead of "Support".
>
> 2) table.optimizer.source.connect-statistics-enabled
> The "connect" word should be "collect"?
>
> 3) CatalogStatistics
> I was a little confused when I first saw the name. I thought it reports
> stats for a catalog...
> Why not use "TableStats" which already wraps "ColumnStats" in it and is a
> public API as well?
>
> 4) source.statistics-type
> vs table.optimizer.source.collect-statistics-enabled
> What's the difference between them? It seems that they are both used to
> enable or disable reporting stats.
>
> 5) "Which connectors and formats will be supported by default?"
> IMO, we should also support Hive source as well in this FLIP.
> Hive source is more widely used than Filesystem connector.
>
> Best,
> Jark
>
>
>
>
> On Tue, 17 May 2022 at 10:52, Jingsong Li  wrote:
>
> > Hi Godfrey,
> >
> > Thanks for your reply.
> >
> > Sounds good to me.
> >
> > > I think we should also introduce a config option
> >
> > We can add this option to the FLIP. I prefer a option for
> > FileSystemConnector, maybe a enum.
> >
> > Best,
> > Jingsong
> >
> > On Tue, May 17, 2022 at 10:31 AM godfrey he  wrote:
> >
> > > Hi Jingsong,
> > >
> > > Thanks for the feedback.
> > >
> > >
> > > >One concern I have is that we read the footer for each file, and this
> > may
> > > >be a bit costly in some cases. Is it possible for us to have some
> > > > hierarchical way
> > > yes, if there are thousands of orc/parquet files, it may take a long
> > time.
> > > So we can introduce a config option to let the user choose the
> > > granularity of the statistics.
> > > But the SIZE will not be introduced, because the planner does not use
> > > the file size statistics now.
> > > We can introduce once file size statistics is introduce in the future.
> > > I think we should also introduce a config option to enable/disable
> > > SupportStatisticReport,
> > > because it's a heavy operation for some connectors in some cases.
> > >
> > > > is the filter pushdown already happening at
> > > > this time?
> > > That's a good point. Currently, the filter push down is after partition
> > > pruning
> > > to prevent the filter push down rule from consuming the partition
> > > predicates.
> > > The statistics will be set to unknown if filter is pushed down now.
> > > To combine them all, we can create an optimization program after filter
> > > push
> > > down program to collect the statistics. This could avoid collecting
> > > statistics multiple times.
> > >
> > >
> > > Best,
> > > Godfrey
> > >
> > > Jingsong Li  于2022年5月13日周五 22:44写道:
> > > >
> > > > Thank Godfrey for driving.
> > > >
> > > > Looks very good~ This will undoubtedly greatly enhance the various
> > batch
> > > > mode connectors.
> > > >
> > > > I left some comments:
> > > >
> > > > ## FileBasedStatisticsReportableDecodingFormat
> > > >
> > > > One concern I have is that we read the footer for each file, and this
> > may
> > > > be a bit costly in some cases. Is it possible for us to have some
> > > > hierarchical way, e.g.
> > > > - No statistics are collected for files by default.
> > > > - SIZE: Generate statistics based on file Size, get the size of the
> > file
> > > > only with access to the master of the FileSys

Re: [VOTE] FLIP-91: Support SQL Gateway

2022-05-22 Thread godfrey he
+1

Best,
Godfrey

LuNing Wang  于2022年5月23日周一 13:06写道:
>
> +1 (non-binding)
>
> Best,
> LuNing Wang
>
> Nicholas Jiang  于2022年5月23日周一 12:57写道:
>
> > +1 (non-binding)
> >
> > Best,
> > Nicholas Jiang
> >
> > On 2022/05/20 02:38:39 Shengkai Fang wrote:
> > > Hi, everyone.
> > >
> > > Thanks for your feedback for FLIP-91: Support SQL Gateway[1] on the
> > > discussion thread[2]. I'd like to start a vote for it. The vote will be
> > > open for at least 72 hours unless there is an objection or not enough
> > votes.
> > >
> > > Best,
> > > Shengkai
> > >
> > > [1]
> > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-91%3A+Support+SQL+Gateway
> > > [2]https://lists.apache.org/thread/gr7soo29z884r1scnz77r2hwr2xmd9b0
> > >
> >


Re: [VOTE] FLIP-229: Introduces Join Hint for Flink SQL Batch Job

2022-05-17 Thread godfrey he
Thanks Xuyang for driving this, +1(binding)

Best,
Godfrey

Xuyang  于2022年5月17日周二 10:21写道:
>
> Hi, everyone.
> Thanks for your feedback for FLIP-229: Introduces Join Hint for Flink SQL 
> Batch Job[1] on the discussion thread[2].
> I'd like to start a vote for it. The vote will be open for at least 72 hours 
> unless there is an objection or not enough votes.
>
> --
>
> Best!
> Xuyang
>
>
> [1] 
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-229%3A+Introduces+Join+Hint+for+Flink+SQL+Batch+Job
> [2] https://lists.apache.org/thread/y668bxyjz66ggtjypfz9t571m0tyvv9h


Re: Re: [DISCUSS] FLIP-218: Support SELECT clause in CREATE TABLE(CTAS)

2022-05-17 Thread godfrey he
Hi Mang,

Thanks for driving this FLIP.

Please follow the FLIP template[1] style, and the `Syntax ` is part of
the `Public API Changes` section.
‘Program research’ and 'Implementation Plan' are part of the `Proposed
Changes` section,
or move ‘Program research’ to the appendix.

> Providing methods that are used to execute CTAS for Table API users.
We should introduce `createTable` in `Table` instead of `TableEnvironment`.
Because all table operations are defined in `Table`, see: Table#executeInsert,
Table#insertInto, etc.
About the method name, I prefer to use `createTableAs`.

> TableSink needs to provide the CleanUp API, developers implement as needed.
I think it's hard for TableSink to implement a clean up operation. For
file system sink,
the data can be written to a temporary directory, but for key/value
sinks, it's hard to
remove the written keys, unless the sink records all written keys.

> Do not do drop table operations in the framework, drop table is implemented in
TableSink according to the needs of specific TableSink
The TM process may crash at any time, and the drop operation will not
be executed any more.

How about we do the drop table operation and cleanup data action in the catalog?
Where to execute the drop operation. one approach is in client, other is in JM.
1. in client: this requires the client to be alive until the job is
finished and failed.
2. in JM: this requires the JM could provide some interfaces/hooks
that the planner
implements the logic and the code will be executed in JM.
I prefer the approach two, but it requires more detail design with
runtime @gaoyunhaii, @kevin.yingjie


[1] https://cwiki.apache.org/confluence/display/FLINK/FLIP+Template

Best,
Godfrey


Mang Zhang  于2022年5月6日周五 11:24写道:

>
> Hi, Yuxia
> Thanks for your reply!
> About the question 1, we will not support, FLIP-218[1] is to simplify the 
> complexity of user DDL and make it easier for users to use. I have never 
> encountered this case in a big data.
> About the question 2, we will provide a public API like below public void 
> cleanUp();
>
>   Regarding the mechanism of cleanUp, people who are familiar with the 
> runtime module need to provide professional advice, which is what we need to 
> focus on.
>
>
>
>
>
>
>
>
>
>
> --
>
> Best regards,
> Mang Zhang
>
>
>
>
>
> At 2022-04-29 17:00:03, "yuxia"  wrote:
> >Thanks for for driving this work, it's to be a useful feature.
> >About the flip-218, I have some questions.
> >
> >1: Does our CTAS syntax support specify target table's schema including 
> >column name and data type? I think it maybe a useful fature in case we want 
> >to change the data types in target table instead of always copy the source 
> >table's schema. It'll be more flexible with this feature.
> >Btw, MySQL's "CREATE TABLE ... SELECT Statement"[1] support this feature.
> >
> >2: Seems it'll requre sink to implement an public interface to drop table, 
> >so what's the interface will look like?
> >
> >[1] https://dev.mysql.com/doc/refman/8.0/en/create-table-select.html
> >
> >Best regards,
> >Yuxia
> >
> >- 原始邮件 -
> >发件人: "Mang Zhang" 
> >收件人: "dev" 
> >发送时间: 星期四, 2022年 4 月 28日 下午 4:57:24
> >主题: [DISCUSS] FLIP-218: Support SELECT clause in CREATE TABLE(CTAS)
> >
> >Hi, everyone
> >
> >
> >I would like to open a discussion for support select clause in CREATE 
> >TABLE(CTAS),
> >With the development of business and the enhancement of flink sql 
> >capabilities, queries become more and more complex.
> >Now the user needs to use the Create Table statement to create the target 
> >table first, and then execute the insert statement.
> >However, the target table may have many columns, which will bring a lot of 
> >work outside the business logic to the user.
> >At the same time, ensure that the schema of the created target table is 
> >consistent with the schema of the query result.
> >Using a CTAS syntax like Hive/Spark can greatly facilitate the user.
> >
> >
> >
> >You can find more details in FLIP-218[1]. Looking forward to your feedback.
> >
> >
> >
> >[1] 
> >https://cwiki.apache.org/confluence/display/FLINK/FLIP-218%3A+Support+SELECT+clause+in+CREATE+TABLE(CTAS)
> >
> >
> >
> >
> >--
> >
> >Best regards,
> >Mang Zhang
>
>


Re: [DISCUSS] FLIP-222: Support full query lifecycle statements in SQL client

2022-05-17 Thread godfrey he
Hi Paul,

Thanks for driving this, LGTM overall.

I have a few minor comments:

>SHOW QUERIES
I want to clear the scope the command, does the command show the
queries submitted
via SqlClient, or all queries in current cluster (submitted via other CLI)?
History queries are included? What's the behavior for per-job cluster?

The result should contain 'finish_time' field, which is more friendly
for batch job.

>DROP QUERY ''
What's the behavior for batch jobs and the non-running jobs?

>SAVEPOINT ''
+1 to align with the SQL standard.
What's the behavior for batch jobs?

SHOW SAVEPOINTS is missing.

* Table API
+1 to introduce the API in Table API

Best,
Godfrey

Paul Lam  于2022年5月11日周三 19:20写道:
>
> Hi Jark,
>
> Thanks a lot for your opinions and suggestions! Please see my replies inline.
>
> > 1) the display of savepoint_path
>
>
> Agreed. Adding it to the FLIP.
>
> > 2) Please make a decision on multiple options in the FLIP.
>
> Okay. I’ll keep one and move the other to the rejected alternatives section.
>
> > 4) +1 SHOW QUERIES
> > Btw, the displayed column "address" is a little confusing to me.
> > At the first glance, I'm not sure what address it is, JM RPC address? JM 
> > REST address? Gateway address?
> > If this is a link to the job's web UI URL, how about calling it "web_url" 
> > and display in
> > "http://:" format?
> > Besides, how about displaying "startTime" or "uptime" as well?
>
> I’m good with these changes. Updating the FLIP according to your suggestions.
>
> > 5) STOP/CANCEL QUERY vs DROP QUERY
> > I'm +1 to DROP, because it's more compliant with SQL standard naming, i.e., 
> > "SHOW/CREATE/DROP".
> > Separating STOP and CANCEL confuses users a lot what are the differences 
> > between them.
> > I'm +1 to add the "PURGE" keyword to the DROP QUERY statement, which 
> > indicates to stop query without savepoint.
> > Note that, PURGE doesn't mean stop with --drain flag. The drain flag will 
> > flush all the registered timers
> > and windows which could lead to incorrect results when the job is resumed. 
> > I think the drain flag is rarely used
> > (please correct me if I'm wrong), therefore, I suggest moving this feature 
> > into future work when the needs are clear.
>
> I’m +1 to represent ungrateful cancel by PURGE. I think —drain flag is not 
> used very often as you said, and we
> could just add a table config option to enable that flag.
>
> > 7)  and  should be quoted
> > All the  and  should be string literal, otherwise 
> > it's hard to parse them.
> > For example, STOP QUERY '’.
>
> Good point! Adding it to the FLIP.
>
> > 8) Examples
> > Could you add an example that consists of all the statements to show how to 
> > manage the full lifecycle of queries?
> > Including show queries, create savepoint, remove savepoint, stop query with 
> > a savepoint, and restart query with savepoint.
>
> Agreed. Adding it to the FLIP as well.
>
> Best,
> Paul Lam
>
> > 2022年5月7日 18:22,Jark Wu  写道:
> >
> > Hi Paul,
> >
> > I think this FLIP has already in a good shape. I just left some additional 
> > thoughts:
> >
> > 1) the display of savepoint_path
> > Could the displayed savepoint_path include the scheme part?
> > E.g. `hdfs:///flink-savepoints/savepoint-cca7bc-bb1e257f0dab`
> > IIUC, the scheme part is omitted when it's a local filesystem.
> > But the behavior would be clearer if including the scheme part in the 
> > design doc.
> >
> > 2) Please make a decision on multiple options in the FLIP.
> > It might give the impression that we will support all the options.
> >
> > 3) +1 SAVEPOINT and RELEASE SAVEPOINT
> > Personally, I also prefer "SAVEPOINT " and "RELEASE SAVEPOINT 
> > "
> > to "CREATE/DROP SAVEPOINT", as they have been used in mature databases.
> >
> > 4) +1 SHOW QUERIES
> > Btw, the displayed column "address" is a little confusing to me.
> > At the first glance, I'm not sure what address it is, JM RPC address? JM 
> > REST address? Gateway address?
> > If this is a link to the job's web UI URL, how about calling it "web_url" 
> > and display in
> > "http://:" format?
> > Besides, how about displaying "startTime" or "uptime" as well?
> >
> > 5) STOP/CANCEL QUERY vs DROP QUERY
> > I'm +1 to DROP, because it's more compliant with SQL standard naming, i.e., 
> > "SHOW/CREATE/DROP".
> > Separating STOP and CANCEL confuses users a lot what are the differences 
> > between them.
> > I'm +1 to add the "PURGE" keyword to the DROP QUERY statement, which 
> > indicates to stop query without savepoint.
> > Note that, PURGE doesn't mean stop with --drain flag. The drain flag will 
> > flush all the registered timers
> > and windows which could lead to incorrect results when the job is resumed. 
> > I think the drain flag is rarely used
> > (please correct me if I'm wrong), therefore, I suggest moving this feature 
> > into future work when the needs are clear.
> >
> > 6) Table API
> > I think it makes sense to support the new statements in Table API.
> > We should try to make the Gateway and CLI

Re: [DISCUSS] FLIP-231: Introduce SupportStatisticReport to support reporting statistics from source connectors

2022-05-16 Thread godfrey he
Hi Jingsong,

Thanks for the feedback.


>One concern I have is that we read the footer for each file, and this may
>be a bit costly in some cases. Is it possible for us to have some
> hierarchical way
yes, if there are thousands of orc/parquet files, it may take a long time.
So we can introduce a config option to let the user choose the
granularity of the statistics.
But the SIZE will not be introduced, because the planner does not use
the file size statistics now.
We can introduce once file size statistics is introduce in the future.
I think we should also introduce a config option to enable/disable
SupportStatisticReport,
because it's a heavy operation for some connectors in some cases.

> is the filter pushdown already happening at
> this time?
That's a good point. Currently, the filter push down is after partition pruning
to prevent the filter push down rule from consuming the partition predicates.
The statistics will be set to unknown if filter is pushed down now.
To combine them all, we can create an optimization program after filter push
down program to collect the statistics. This could avoid collecting
statistics multiple times.


Best,
Godfrey

Jingsong Li  于2022年5月13日周五 22:44写道:
>
> Thank Godfrey for driving.
>
> Looks very good~ This will undoubtedly greatly enhance the various batch
> mode connectors.
>
> I left some comments:
>
> ## FileBasedStatisticsReportableDecodingFormat
>
> One concern I have is that we read the footer for each file, and this may
> be a bit costly in some cases. Is it possible for us to have some
> hierarchical way, e.g.
> - No statistics are collected for files by default.
> - SIZE: Generate statistics based on file Size, get the size of the file
> only with access to the master of the FileSystem.
> - DETAILED: Get the complete statistics by format, possibly by accessing
> the footer of the file.
>
> ## When use the statistics reported by connector
>
> > When partitions are pruned by PushPartitionIntoTableSourceScanRule, the
> statistics should also be updated.
>
> I understand that we definitely need to use reporter after the partition
> prune, but another question: is the filter pushdown already happening at
> this time?
> Can we make sure that in the following three cases, both the filter
> pushdown and the partition prune happen before the stats reporting.
> - only partition prune happens
> - only filter pushdown happens
> - both filter pushdown and partition prune happen
>
> Best,
> Jingsong
>
> On Fri, May 13, 2022 at 6:57 PM godfrey he  wrote:
>
> > Hi all,
> >
> > I would like to open a discussion on FLIP-231:  Introduce
> > SupportStatisticReport
> > to support reporting statistics from source connectors.
> >
> > Statistics are one of the most important inputs to the optimizer.
> > Accurate and complete statistics allows the optimizer to be more powerful.
> > Currently, the statistics of Flink SQL come from Catalog only,
> > while many Connectors have the ability to provide statistics, e.g.
> > FileSystem.
> > In production, we find many tables in Catalog do not have any statistics.
> > As a result, the optimizer can't generate better execution plans,
> > especially for Batch jobs.
> >
> > There are two approaches to enhance statistics for the planner,
> > one is to introduce the "ANALYZE TABLE" syntax which will write
> > the analyzed result to the catalog, another is to introduce a new
> > connector interface
> > which allows the connector itself to report statistics directly to the
> > planner.
> > The second one is a supplement to the catalog statistics.
> >
> > Here, we will discuss the second approach. Compared to the first one,
> > the second one is to get statistics in real time, no need to run an
> > analysis job for each table. This could help improve the user
> > experience.
> > (We will also introduce the "ANALYZE TABLE" syntax in other FLIP.)
> >
> > You can find more details in FLIP-231 document[1]. Looking forward to
> > your feedback.
> >
> > [1]
> > https://cwiki.apache.org/confluence/pages/resumedraft.action?draftId=211883860&draftShareId=eda17eaa-43f9-4dc1-9a7d-3a9b5a4bae00&;
> > [2] POC: https://github.com/godfreyhe/flink/tree/FLIP-231
> >
> >
> > Best,
> > Godfrey
> >


[DISCUSS] FLIP-231: Introduce SupportStatisticReport to support reporting statistics from source connectors

2022-05-13 Thread godfrey he
Hi all,

I would like to open a discussion on FLIP-231:  Introduce SupportStatisticReport
to support reporting statistics from source connectors.

Statistics are one of the most important inputs to the optimizer.
Accurate and complete statistics allows the optimizer to be more powerful.
Currently, the statistics of Flink SQL come from Catalog only,
while many Connectors have the ability to provide statistics, e.g. FileSystem.
In production, we find many tables in Catalog do not have any statistics.
As a result, the optimizer can't generate better execution plans,
especially for Batch jobs.

There are two approaches to enhance statistics for the planner,
one is to introduce the "ANALYZE TABLE" syntax which will write
the analyzed result to the catalog, another is to introduce a new
connector interface
which allows the connector itself to report statistics directly to the planner.
The second one is a supplement to the catalog statistics.

Here, we will discuss the second approach. Compared to the first one,
the second one is to get statistics in real time, no need to run an
analysis job for each table. This could help improve the user
experience.
(We will also introduce the "ANALYZE TABLE" syntax in other FLIP.)

You can find more details in FLIP-231 document[1]. Looking forward to
your feedback.

[1] 
https://cwiki.apache.org/confluence/pages/resumedraft.action?draftId=211883860&draftShareId=eda17eaa-43f9-4dc1-9a7d-3a9b5a4bae00&;
[2] POC: https://github.com/godfreyhe/flink/tree/FLIP-231


Best,
Godfrey


[jira] [Created] (FLINK-27591) Improve the plan for batch queries when statistics is unavailable

2022-05-12 Thread godfrey he (Jira)
godfrey he created FLINK-27591:
--

 Summary: Improve the plan for batch queries when statistics is 
unavailable 
 Key: FLINK-27591
 URL: https://issues.apache.org/jira/browse/FLINK-27591
 Project: Flink
  Issue Type: New Feature
  Components: Table SQL / Planner
Reporter: godfrey he


This jira is an umbrella issue, which aims to improve the plan for batch 
queries when statistics is unavailable.
Currently, when statistics is unavailable, the planner will give default cost, 
which may lead to the planner choosing bad plan, such as: wrong broadcast join 
plan will cause a lot of network shuffle and OOM.

We can detect whether the source tables have statistics. if not, join order, 
hash join can be disabled.




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (FLINK-27583) Improve the plan for TPC-DS queries

2022-05-12 Thread godfrey he (Jira)
godfrey he created FLINK-27583:
--

 Summary: Improve the plan for TPC-DS queries
 Key: FLINK-27583
 URL: https://issues.apache.org/jira/browse/FLINK-27583
 Project: Flink
  Issue Type: New Feature
  Components: Table SQL / Planner
Reporter: godfrey he


This jira is an umbrella issue, which aims to fix and improve the plan for 
TPC-DS
including: fix join order bad case, improve multiple input, fix some cost model 
bad case, etc



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


  1   2   3   4   5   >