I'd like to entertain the idea of deriving min/max values on the fly to understand our baseline. What will the algorithm for that look like? I assume the naive approach will be to keep min/max stats for selected columns while planning (as opposed to discarding them upon filtering) and then iterate through FileScanTask objects in memory to aggregate the results?
- Anton вт, 6 серп. 2024 р. о 08:33 Gang Wu <ust...@gmail.com> пише: > Just give my two cents. Not all tables have partition definition and > table-level stats would > benefit these tables. In addition, NDV might not be easily populated from > partition-level > statistics. > > Thanks, > Gang > > On Tue, Aug 6, 2024 at 9:48 PM Xianjin YE <xian...@apache.org> wrote: > >> Thanks for raising the discussion Huaxin. >> >> I also think partition-level statistics file(s) are more useful and has >> advantage over table-level stats. For instance: >> 1. It would be straight forward to support incremental stats computing >> for large tables: by recalculating new or updated partitions only >> 2. Table level stats could be built from partition level stats easily. It >> might take some additional time to merge though. >> 3. SparkScan usually involve parts of partitions if `partition filter` is >> involved, it would be more accurate to infer the stats from the selected >> partitions only >> >> On Aug 6, 2024, at 02:42, Alexander Jo <alex...@starburstdata.com> wrote: >> >> Thanks for starting this thread Huaxin, >> >> The existing statistics, on a per data file basis, are definitely too >> granular for use in planning/analysis time query optimizations. >> It's worked so far, as tables have been relatively small, but from what >> I've seen in the Trino community it is starting to be a problem for some. >> >> However, I'm not sure that rolling the stats all the way up to the table >> level is what we want to add next. >> Would extending the partition stats to include some of the data present >> for data files be possible? >> To me, it seems like doing some amount of derivation at query time is >> okay, as long as the time it takes to do the derivation doesn't increase >> significantly as the table gets larger. >> >> Partition level stats also have the advantage of being able to provide >> more accurate estimates for queries with filters on the partition columns. >> >> Looking forward to talking more about the project, >> Alex Jo >> >> >> >> On Fri, Aug 2, 2024 at 11:24 PM huaxin gao <huaxin.ga...@gmail.com> >> wrote: >> >>> Thanks, Samrose and Piotr, for the discussion! This issue is not >>> addressed by the partition statistics feature. What we need are table level >>> stats. >>> >>> Given that our primary goal in collecting statistics is for performance >>> optimization, I believe it's not a good approach to derive these statistics >>> at query execution time. That's why I propose saving these metrics in the >>> table-level stats file. I am thinking of reusing the existing aggregate >>> pushdown mechanism to compute min, max and null counts. We currently have >>> MinAggregate >>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/MinAggregate.java>, >>> MaxAggregate >>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/MaxAggregate.java>, >>> CountStar >>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/CountStar.java>, >>> and CountNotNull >>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/CountNonNull.java>. >>> Null counts can be derived from CountStar and CountNotNull. >>> >>> Thanks, >>> Huaxin >>> >>> On Fri, Aug 2, 2024 at 1:45 PM Piotr Findeisen < >>> piotr.findei...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> First of all, thank you Huaxin for raising this topic. It's important >>>> for Spark, but also for Trino. >>>> >>>> Min, max, and null counts can be derived from manifests. >>>> I am not saying that a query engine should derive them from manifests >>>> at query time, but it definitely can. >>>> If we want to pull min, max, and null counts table-level summary into >>>> stats Puffin file (which probably makes sense), my concern would be what >>>> the exact mechanics for that should be. NDVs need to be derived from the >>>> data files, but maybe we can do something smarter for min, max, and null >>>> counts. >>>> >>>> Best, >>>> Piotr >>>> >>>> >>>> >>>> On Fri, 2 Aug 2024 at 20:47, Samrose Ahmed <samroseah...@gmail.com> >>>> wrote: >>>> >>>>> Isn't this addressed by the partition statistics feature, or do you >>>>> want to have one row for the entire table? >>>>> >>>>> On Fri, Aug 2, 2024, 10:47 AM huaxin gao <huaxin.ga...@gmail.com> >>>>> wrote: >>>>> >>>>>> I would like to initiate a discussion on implementing a table-level >>>>>> statistics file to store column statistics, specifically min, max, and >>>>>> null >>>>>> counts. The original discussion can be found in this Slack thread: >>>>>> https://apache-iceberg.slack.com/archives/C03LG1D563F/p1676395480005779 >>>>>> . >>>>>> >>>>>> In Spark 3.4, I introduced Column Statistics >>>>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java#L33> >>>>>> within the Statistics interface >>>>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/Statistics.java#L38>, >>>>>> enabling Iceberg to implement and report these metrics to Spark. As a >>>>>> result, Spark can utilize the column statistics for cost-based >>>>>> optimization >>>>>> (CBO), including join reordering. >>>>>> >>>>>> Here’s how the process operates: >>>>>> >>>>>> >>>>>> - Use the ANALYZE TABLE command to compute column statistics and >>>>>> store them appropriately. The SQL syntax is: >>>>>> >>>>>> >>>>>> ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS col1, col2… >>>>>> >>>>>> This command will trigger a ComputeTableStatsSparkAction to compute a >>>>>> data-sketch for NDV (number of distinct values) and saves this in the >>>>>> StatisticsFile. Here is the ComputeTableStatsAction PR >>>>>> <https://github.com/apache/iceberg/pull/10288> >>>>>> >>>>>> Additionally, the ANALYZE TABLE command needs to calculate >>>>>> table-level min, max, and null counts for columns like col1 and col2 >>>>>> using >>>>>> the manifest files. These table-level statistics are then saved. >>>>>> Currently, >>>>>> we do not have a designated location to store these table-level column >>>>>> statistics. The StatisticsFile >>>>>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/StatisticsFile.java>, >>>>>> being tightly coupled with the Puffin file, does not seem to be a >>>>>> suitable >>>>>> repository for these column stats. >>>>>> >>>>>> >>>>>> >>>>>> - If Spark's CBO is on and SparkSQLProperties.REPORT_COLUMN_STATS >>>>>> >>>>>> <https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java#L95> >>>>>> is enabled, NDV will be retrieved from the StatisticsFile, and the >>>>>> min/max/null counts will also be retrieved from a table-level column >>>>>> statistics file. These statistics will be utilized to construct the >>>>>> Statistics object in estimateStatistics >>>>>> >>>>>> <https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java#L186>. >>>>>> Spark then employs the statistics returned from estimateStatistics >>>>>> >>>>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala#L86> >>>>>> for cost-based optimization (CBO). >>>>>> >>>>>> >>>>>> Please share your thoughts on this. >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Huaxin >>>>>> >>>>>> >>>>>> >>>>>> >>