Re: [DISCUSS] FLIP-231: Introduce SupportStatisticReport to support reporting statistics from source connectors

godfrey he Tue, 31 May 2022 06:42:27 -0700

Hi Jark and Jing,

+1 to use "report" instead of "collect".


>  // only filter push down (*the description means
> partitionPushDownSpec == null but misses the case of
> partitionPushDownSpec != null*)

`if (partitionPushDownSpec != null && filterPushDownSpec == null)`
this branch is only consider that the partition is partition is pushed down,
but no filter is push down. The planner will collect the statistics
from catalog first,
and then try to collect the statistics from collectors if the catalog
statistics is unknown.

`else if (filterPushDownSpec != null)` this branch means  whether
the partitionPushDownSpec is null or not, the planner will collect
statistics from
collectors only, because the catalog do not support get statistics with filters.

`else if (collectStatEnabled
                && (table.getStatistic().getTableStats() == TableStats.UNKNOWN)
                && tableSource instanceof SupportStatisticReport)`
this branch means no partition and no filter are pushed down.

or we can change the pseudocode to:
 if (filterPushDownSpec != null) {
   // filter push down, no mater  partition push down or not
} else {
    if (partitionPushDownSpec != null) {
        // partition push down, while filter not
     } else {
         // no partition and filter push down
     }
}

Best,
Godfrey

Jing Ge <[email protected]> 于2022年5月29日周日 08:17写道：
>
> Hi Godfrey,
>
> Thanks for driving this FLIP.  It looks really good! Looking forward to it!
>
> If I am not mistaken, partition pruning could also happen in the following
> pseudocode condition block:
>
> else if (filterPushDownSpec != null) {
>             // only filter push down (*the description means
> partitionPushDownSpec == null but misses the case of
> partitionPushDownSpec != null*)
>
>             // the catalog do not support get statistics with filters,
>             // so only call reportStatistics method if needed
>             if (collectStatEnabled && tableSource instanceof
> SupportStatisticReport) {
>                 newTableStat = ((SupportStatisticReport)
> tableSource).reportStatistics();
>             }
>
>
> Best regards,
>
> Jing
>
>
> On Sat, May 28, 2022 at 5:09 PM Jark Wu <[email protected]> wrote:
>
> > Hi Godfrey,
> >
> > It seems that the "SupportStatisticReport" interface name and
> >  "table.optimizer.source.connect-statistics-enabled" option name is not
> > updated in the FLIP.
> >
> > Besides, in the terms of the option name, the meaning of
> > "source.statistics-type"
> > is not very straightforward and clean to me. Maybe
> > "source.report-statistics" = "none/all/file-size"
> > would be better.
> >
> > We can also change "table.optimizer.source.connect-statistics-enabled" to
> > "table.optimizer.source.report-statistics-enabled" for alignment and
> >  it's clear that one for fine-grained and one for coarse-grained.
> >
> >
> > Best,
> > Jark
> >
> > On Fri, 27 May 2022 at 22:58, godfrey he <[email protected]> wrote:
> >
> > > Hi, everyone.
> > >
> > > Thanks for all the inputs.
> > > If there is no more feedback, I think we can start the vote next monday.
> > >
> > > Best,
> > > Godfrey
> > >
> > > Martijn Visser <[email protected]> 于2022年5月25日周三 19:46写道：
> > > >
> > > > Hi Godfrey,
> > > >
> > > > Thanks for creating the FLIP. I have no comments.
> > > >
> > > > Best regards,
> > > >
> > > > Martijn
> > > >
> > > >
> > > > On Tue, 17 May 2022 at 04:52, Jingsong Li <[email protected]>
> > > wrote:
> > > >
> > > > > Hi Godfrey,
> > > > >
> > > > > Thanks for your reply.
> > > > >
> > > > > Sounds good to me.
> > > > >
> > > > > > I think we should also introduce a config option
> > > > >
> > > > > We can add this option to the FLIP. I prefer a option for
> > > > > FileSystemConnector, maybe a enum.
> > > > >
> > > > > Best,
> > > > > Jingsong
> > > > >
> > > > > On Tue, May 17, 2022 at 10:31 AM godfrey he <[email protected]>
> > > wrote:
> > > > >
> > > > > > Hi Jingsong,
> > > > > >
> > > > > > Thanks for the feedback.
> > > > > >
> > > > > >
> > > > > > >One concern I have is that we read the footer for each file, and
> > > this
> > > > > may
> > > > > > >be a bit costly in some cases. Is it possible for us to have some
> > > > > > > hierarchical way
> > > > > > yes, if there are thousands of orc/parquet files, it may take a
> > long
> > > > > time.
> > > > > > So we can introduce a config option to let the user choose the
> > > > > > granularity of the statistics.
> > > > > > But the SIZE will not be introduced, because the planner does not
> > use
> > > > > > the file size statistics now.
> > > > > > We can introduce once file size statistics is introduce in the
> > > future.
> > > > > > I think we should also introduce a config option to enable/disable
> > > > > > SupportStatisticReport,
> > > > > > because it's a heavy operation for some connectors in some cases.
> > > > > >
> > > > > > > is the filter pushdown already happening at
> > > > > > > this time?
> > > > > > That's a good point. Currently, the filter push down is after
> > > partition
> > > > > > pruning
> > > > > > to prevent the filter push down rule from consuming the partition
> > > > > > predicates.
> > > > > > The statistics will be set to unknown if filter is pushed down now.
> > > > > > To combine them all, we can create an optimization program after
> > > filter
> > > > > > push
> > > > > > down program to collect the statistics. This could avoid collecting
> > > > > > statistics multiple times.
> > > > > >
> > > > > >
> > > > > > Best,
> > > > > > Godfrey
> > > > > >
> > > > > > Jingsong Li <[email protected]> 于2022年5月13日周五 22:44写道：
> > > > > > >
> > > > > > > Thank Godfrey for driving.
> > > > > > >
> > > > > > > Looks very good~ This will undoubtedly greatly enhance the
> > various
> > > > > batch
> > > > > > > mode connectors.
> > > > > > >
> > > > > > > I left some comments:
> > > > > > >
> > > > > > > ## FileBasedStatisticsReportableDecodingFormat
> > > > > > >
> > > > > > > One concern I have is that we read the footer for each file, and
> > > this
> > > > > may
> > > > > > > be a bit costly in some cases. Is it possible for us to have some
> > > > > > > hierarchical way, e.g.
> > > > > > > - No statistics are collected for files by default.
> > > > > > > - SIZE: Generate statistics based on file Size, get the size of
> > the
> > > > > file
> > > > > > > only with access to the master of the FileSystem.
> > > > > > > - DETAILED: Get the complete statistics by format, possibly by
> > > > > accessing
> > > > > > > the footer of the file.
> > > > > > >
> > > > > > > ## When use the statistics reported by connector
> > > > > > >
> > > > > > > > When partitions are pruned by
> > > PushPartitionIntoTableSourceScanRule,
> > > > > the
> > > > > > > statistics should also be updated.
> > > > > > >
> > > > > > > I understand that we definitely need to use reporter after the
> > > > > partition
> > > > > > > prune, but another question: is the filter pushdown already
> > > happening
> > > > > at
> > > > > > > this time?
> > > > > > > Can we make sure that in the following three cases, both the
> > filter
> > > > > > > pushdown and the partition prune happen before the stats
> > reporting.
> > > > > > > - only partition prune happens
> > > > > > > - only filter pushdown happens
> > > > > > > - both filter pushdown and partition prune happen
> > > > > > >
> > > > > > > Best,
> > > > > > > Jingsong
> > > > > > >
> > > > > > > On Fri, May 13, 2022 at 6:57 PM godfrey he <[email protected]>
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > I would like to open a discussion on FLIP-231:  Introduce
> > > > > > > > SupportStatisticReport
> > > > > > > > to support reporting statistics from source connectors.
> > > > > > > >
> > > > > > > > Statistics are one of the most important inputs to the
> > optimizer.
> > > > > > > > Accurate and complete statistics allows the optimizer to be
> > more
> > > > > > powerful.
> > > > > > > > Currently, the statistics of Flink SQL come from Catalog only,
> > > > > > > > while many Connectors have the ability to provide statistics,
> > > e.g.
> > > > > > > > FileSystem.
> > > > > > > > In production, we find many tables in Catalog do not have any
> > > > > > statistics.
> > > > > > > > As a result, the optimizer can't generate better execution
> > plans,
> > > > > > > > especially for Batch jobs.
> > > > > > > >
> > > > > > > > There are two approaches to enhance statistics for the planner,
> > > > > > > > one is to introduce the "ANALYZE TABLE" syntax which will write
> > > > > > > > the analyzed result to the catalog, another is to introduce a
> > new
> > > > > > > > connector interface
> > > > > > > > which allows the connector itself to report statistics directly
> > > to
> > > > > the
> > > > > > > > planner.
> > > > > > > > The second one is a supplement to the catalog statistics.
> > > > > > > >
> > > > > > > > Here, we will discuss the second approach. Compared to the
> > first
> > > one,
> > > > > > > > the second one is to get statistics in real time, no need to
> > run
> > > an
> > > > > > > > analysis job for each table. This could help improve the user
> > > > > > > > experience.
> > > > > > > > (We will also introduce the "ANALYZE TABLE" syntax in other
> > > FLIP.)
> > > > > > > >
> > > > > > > > You can find more details in FLIP-231 document[1]. Looking
> > > forward to
> > > > > > > > your feedback.
> > > > > > > >
> > > > > > > > [1]
> > > > > > > >
> > > > > >
> > > > >
> > >
> > https://cwiki.apache.org/confluence/pages/resumedraft.action?draftId=211883860&draftShareId=eda17eaa-43f9-4dc1-9a7d-3a9b5a4bae00&;
> > > > > > > > [2] POC: https://github.com/godfreyhe/flink/tree/FLIP-231
> > > > > > > >
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Godfrey
> > > > > > > >
> > > > > >
> > > > >
> > >
> >

Re: [DISCUSS] FLIP-231: Introduce SupportStatisticReport to support reporting statistics from source connectors

Reply via email to