Hi Godfrey, thanks for starting this discussion, this is a great feature
for batch users.

The FLIP looks good to me in general.

I only have 2 comments:

1) How do users know whether the given table or partition contains required
statistics?
Do we need to provide DESC EXTENDED <table_name> statement like Spark[1] to
show statistic for table/partition/columns?

2) If ANALYZE TABLE can only run in batch mode, is it possible to ignore
execution mode
and force using batch mode for the statement? From my perspective, ANALYZE
TABLE
is an auxiliary statement similar to SHOW TABLES but heavier, which doesn't
care about
environment execution mode.

Best,
Jark

[1]:
https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-aux-analyze-table.html

On Tue, 14 Jun 2022 at 13:52, Jing Ge <j...@ververica.com> wrote:

> Hi 华宗
>
> 退订请发送任意消息至dev-unsubscr...@flink.apache.org
> In order to unsubscribe, please send an email to
> dev-unsubscr...@flink.apache.org
>
> Thanks
>
> Best regards,
> Jing
>
>
> On Tue, Jun 14, 2022 at 2:05 AM 华宗 <zhanghuaz...@126.com> wrote:
>
> > 退订
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > At 2022-06-13 22:44:24, "cao zou" <zoucao...@gmail.com> wrote:
> > >Hi godfrey, thanks for your detail explanation.
> > >After explaining and glancing over the FLIP-231, I think it is
> > >really need, +1 for this and looking forward to it.
> > >
> > >best
> > >zoucao
> > >
> > >godfrey he <godfre...@gmail.com> 于2022年6月13日周一 14:43写道:
> > >
> > >> Hi Ingo,
> > >>
> > >> The semantics does not distinguish batch and streaming,
> > >> It works for both batch and streaming, but the result of
> > >> unbounded sources is meaningless.
> > >> Currently, I throw exception for streaming mode,
> > >> and we can support streaming mode with bounded source
> > >> in the future.
> > >>
> > >> Best,
> > >> Godfrey
> > >>
> > >> Ingo Bürk <airbla...@apache.org> 于2022年6月13日周一 14:17写道:
> > >> >
> > >> > Hi Godfrey,
> > >> >
> > >> > thank you for the explanation. A SELECT is definitely more generic
> and
> > >> > will work for all connectors automatically. As such I think it's a
> > good
> > >> > baseline solution regardless.
> > >> >
> > >> > We can also think about allowing connector-specific optimizations in
> > the
> > >> > future, but I do like your idea of letting the optimizer rules
> > perform a
> > >> > lot of the work here already by leveraging existing optimizations.
> > >> > Similarly things like non-null counts of non-nullable columns would
> > (or
> > >> > at least could) be handled by the optimizer rules already.
> > >> >
> > >> > So as far as that point goes, +1 to the generic approach.
> > >> >
> > >> > One more point, though: In general we should avoid supporting
> features
> > >> > only in specific modes as it breaks the unification promise. Given
> > that
> > >> > ANALYZE is a manual and completely optional operation I'm OK with
> > doing
> > >> > that here in principle. However, I wonder what will happen in the
> > >> > streaming / unbounded case. Do you plan to throw an error? Or do we
> > >> > complete the command as successful but without doing anything?
> > >> >
> > >> >
> > >> > Best
> > >> > Ingo
> > >> >
> > >> > On 13.06.22 05:50, godfrey he wrote:
> > >> > > Hi Ingo,
> > >> > >
> > >> > > Thanks for the inputs.
> > >> > >
> > >> > > I think converting `ANALYZE TABLE` to `SELECT` statement is
> > >> > > more generic approach. Because query plan optimization is more
> > generic,
> > >> > >   we can provide more optimization rules to optimize not only
> > `SELECT`
> > >> statement
> > >> > > converted from `ANALYZE TABLE` but also the `SELECT` statement
> > written
> > >> by users.
> > >> > >
> > >> > >> JDBC connector can get a row count estimate without performing a
> > >> > >> SELECT COUNT(1)
> > >> > > To optimize such cases, we can implement a rule to push aggregate
> > into
> > >> > > table source.
> > >> > > Currently, there is a similar rule: SupportsAggregatePushDown,
> which
> > >> > > supports only pushing
> > >> > > local aggregate into source now.
> > >> > >
> > >> > >
> > >> > > Best,
> > >> > > Godfrey
> > >> > >
> > >> > > Ingo Bürk <airbla...@apache.org> 于2022年6月10日周五 17:15写道:
> > >> > >>
> > >> > >> Hi Godfrey,
> > >> > >>
> > >> > >> compared to the solution proposed in the FLIP (using a SELECT
> > >> > >> statement), I wonder if you have considered adding APIs to
> > catalogs /
> > >> > >> connectors to perform this task as an alternative?
> > >> > >> I could imagine that for many connectors, statistics could be
> > >> > >> implemented in a less expensive way by leveraging the underlying
> > >> system
> > >> > >> (e.g. a JDBC connector can get a row count estimate without
> > >> performing a
> > >> > >> SELECT COUNT(1)).
> > >> > >>
> > >> > >>
> > >> > >> Best
> > >> > >> Ingo
> > >> > >>
> > >> > >>
> > >> > >> On 10.06.22 09:53, godfrey he wrote:
> > >> > >>> Hi all,
> > >> > >>>
> > >> > >>> I would like to open a discussion on FLIP-240:  Introduce
> "ANALYZE
> > >> > >>> TABLE" Syntax.
> > >> > >>>
> > >> > >>> As FLIP-231 mentioned, statistics are one of the most important
> > >> inputs
> > >> > >>> to the optimizer. Accurate and complete statistics allows the
> > >> > >>> optimizer to be more powerful. "ANALYZE TABLE" syntax is a very
> > >> common
> > >> > >>> but effective approach to gather statistics, which is already
> > >> > >>> introduced by many compute engines and databases.
> > >> > >>>
> > >> > >>> The main purpose of  discussion is to introduce "ANALYZE TABLE"
> > >> syntax
> > >> > >>> for Flink sql.
> > >> > >>>
> > >> > >>> You can find more details in FLIP-240 document[1]. Looking
> > forward to
> > >> > >>> your feedback.
> > >> > >>>
> > >> > >>> [1]
> > >>
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=217386481
> > >> > >>> [2] POC: https://github.com/godfreyhe/flink/tree/FLIP-240
> > >> > >>>
> > >> > >>>
> > >> > >>> Best,
> > >> > >>> Godfrey
> > >>
> >
>

Reply via email to