Thanks for raising the discussion Huaxin. I also think partition-level statistics file(s) are more useful and has advantage over table-level stats. For instance: 1. It would be straight forward to support incremental stats computing for large tables: by recalculating new or updated partitions only 2. Table level stats could be built from partition level stats easily. It might take some additional time to merge though. 3. SparkScan usually involve parts of partitions if `partition filter` is involved, it would be more accurate to infer the stats from the selected partitions only
> On Aug 6, 2024, at 02:42, Alexander Jo <alex...@starburstdata.com> wrote: > > Thanks for starting this thread Huaxin, > > The existing statistics, on a per data file basis, are definitely too > granular for use in planning/analysis time query optimizations. > It's worked so far, as tables have been relatively small, but from what I've > seen in the Trino community it is starting to be a problem for some. > > However, I'm not sure that rolling the stats all the way up to the table > level is what we want to add next. > Would extending the partition stats to include some of the data present for > data files be possible? > To me, it seems like doing some amount of derivation at query time is okay, > as long as the time it takes to do the derivation doesn't increase > significantly as the table gets larger. > > Partition level stats also have the advantage of being able to provide more > accurate estimates for queries with filters on the partition columns. > > Looking forward to talking more about the project, > Alex Jo > > > > On Fri, Aug 2, 2024 at 11:24 PM huaxin gao <huaxin.ga...@gmail.com > <mailto:huaxin.ga...@gmail.com>> wrote: >> Thanks, Samrose and Piotr, for the discussion! This issue is not addressed >> by the partition statistics feature. What we need are table level stats. >> >> Given that our primary goal in collecting statistics is for performance >> optimization, I believe it's not a good approach to derive these statistics >> at query execution time. That's why I propose saving these metrics in the >> table-level stats file. I am thinking of reusing the existing aggregate >> pushdown mechanism to compute min, max and null counts. We currently have >> MinAggregate >> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/MinAggregate.java>, >> MaxAggregate >> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/MaxAggregate.java>, >> CountStar >> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/CountStar.java>, >> and CountNotNull >> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/CountNonNull.java>. >> Null counts can be derived from CountStar and CountNotNull. >> >> Thanks, >> Huaxin >> >> On Fri, Aug 2, 2024 at 1:45 PM Piotr Findeisen <piotr.findei...@gmail.com >> <mailto:piotr.findei...@gmail.com>> wrote: >>> Hi, >>> >>> First of all, thank you Huaxin for raising this topic. It's important for >>> Spark, but also for Trino. >>> >>> Min, max, and null counts can be derived from manifests. >>> I am not saying that a query engine should derive them from manifests at >>> query time, but it definitely can. >>> If we want to pull min, max, and null counts table-level summary into stats >>> Puffin file (which probably makes sense), my concern would be what the >>> exact mechanics for that should be. NDVs need to be derived from the data >>> files, but maybe we can do something smarter for min, max, and null counts. >>> >>> Best, >>> Piotr >>> >>> >>> >>> On Fri, 2 Aug 2024 at 20:47, Samrose Ahmed <samroseah...@gmail.com >>> <mailto:samroseah...@gmail.com>> wrote: >>>> Isn't this addressed by the partition statistics feature, or do you want >>>> to have one row for the entire table? >>>> >>>> On Fri, Aug 2, 2024, 10:47 AM huaxin gao <huaxin.ga...@gmail.com >>>> <mailto:huaxin.ga...@gmail.com>> wrote: >>>>> I would like to initiate a discussion on implementing a table-level >>>>> statistics file to store column statistics, specifically min, max, and >>>>> null counts. The original discussion can be found in this Slack thread: >>>>> https://apache-iceberg.slack.com/archives/C03LG1D563F/p1676395480005779 >>>>> <https://apache-iceberg.slack.com/archives/C03LG1D563F/p1676395480005779>. >>>>> >>>>> In Spark 3.4, I introduced Column Statistics >>>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java#L33> >>>>> within the Statistics interface >>>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/Statistics.java#L38>, >>>>> enabling Iceberg to implement and report these metrics to Spark. As a >>>>> result, Spark can utilize the column statistics for cost-based >>>>> optimization (CBO), including join reordering. >>>>> Here’s how the process operates: >>>>> >>>>> >>>>> Use the ANALYZE TABLE command to compute column statistics and store them >>>>> appropriately. The SQL syntax is: >>>>> >>>>> ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS col1, col2… >>>>> >>>>> This command will trigger a ComputeTableStatsSparkAction to compute a >>>>> data-sketch for NDV (number of distinct values) and saves this in the >>>>> StatisticsFile. Here is the ComputeTableStatsAction PR >>>>> <https://github.com/apache/iceberg/pull/10288> >>>>> Additionally, the ANALYZE TABLE command needs to calculate table-level >>>>> min, max, and null counts for columns like col1 and col2 using the >>>>> manifest files. These table-level statistics are then saved. Currently, >>>>> we do not have a designated location to store these table-level column >>>>> statistics. The StatisticsFile >>>>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/StatisticsFile.java>, >>>>> being tightly coupled with the Puffin file, does not seem to be a >>>>> suitable repository for these column stats. >>>>> >>>>> >>>>> If Spark's CBO is on and SparkSQLProperties.REPORT_COLUMN_STATS >>>>> <https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java#L95> >>>>> is enabled, NDV will be retrieved from the StatisticsFile, and the >>>>> min/max/null counts will also be retrieved from a table-level column >>>>> statistics file. These statistics will be utilized to construct the >>>>> Statistics object in estimateStatistics >>>>> <https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java#L186>. >>>>> Spark then employs the statistics returned from estimateStatistics >>>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala#L86> >>>>> for cost-based optimization (CBO). >>>>> >>>>> Please share your thoughts on this. >>>>> >>>>> Thanks, >>>>> >>>>> Huaxin >>>>> >>>>> >>>>> >>>>>