Re: [DISCUSS] Implementing a table-level statistics file to store column statistics

Anton Okolnychyi Tue, 06 Aug 2024 10:40:54 -0700

I'd like to entertain the idea of deriving min/max values on the fly to
understand our baseline. What will the algorithm for that look like? I
assume the naive approach will be to keep min/max stats for selected
columns while planning (as opposed to discarding them upon filtering) and
then iterate through FileScanTask objects in memory to aggregate the
results?


- Anton

вт, 6 серп. 2024 р. о 08:33 Gang Wu <ust...@gmail.com> пише:

> Just give my two cents. Not all tables have partition definition and
> table-level stats would
> benefit these tables. In addition, NDV might not be easily populated from
> partition-level
> statistics.
>
> Thanks,
> Gang
>
> On Tue, Aug 6, 2024 at 9:48 PM Xianjin YE <xian...@apache.org> wrote:
>
>> Thanks for raising the discussion Huaxin.
>>
>> I also think partition-level statistics file(s) are more useful and has
>> advantage over table-level stats. For instance:
>> 1. It would be straight forward to support incremental stats computing
>> for large tables: by recalculating new or updated partitions only
>> 2. Table level stats could be built from partition level stats easily. It
>> might take some additional time to merge though.
>> 3. SparkScan usually involve parts of partitions if `partition filter` is
>> involved, it would be more accurate to infer the stats from the selected
>> partitions only
>>
>> On Aug 6, 2024, at 02:42, Alexander Jo <alex...@starburstdata.com> wrote:
>>
>> Thanks for starting this thread Huaxin,
>>
>> The existing statistics, on a per data file basis, are definitely too
>> granular for use in planning/analysis time query optimizations.
>> It's worked so far, as tables have been relatively small, but from what
>> I've seen in the Trino community it is starting to be a problem for some.
>>
>> However, I'm not sure that rolling the stats all the way up to the table
>> level is what we want to add next.
>> Would extending the partition stats to include some of the data present
>> for data files be possible?
>> To me, it seems like doing some amount of derivation at query time is
>> okay, as long as the time it takes to do the derivation doesn't increase
>> significantly as the table gets larger.
>>
>> Partition level stats also have the advantage of being able to provide
>> more accurate estimates for queries with filters on the partition columns.
>>
>> Looking forward to talking more about the project,
>> Alex Jo
>>
>>
>>
>> On Fri, Aug 2, 2024 at 11:24 PM huaxin gao <huaxin.ga...@gmail.com>
>> wrote:
>>
>>> Thanks, Samrose and Piotr, for the discussion! This issue is not
>>> addressed by the partition statistics feature. What we need are table level
>>> stats.
>>>
>>> Given that our primary goal in collecting statistics is for performance
>>> optimization, I believe it's not a good approach to derive these statistics
>>> at query execution time. That's why I propose saving these metrics in the
>>> table-level stats file. I am thinking of reusing the existing aggregate
>>> pushdown mechanism to compute min, max and null counts. We currently have
>>> MinAggregate
>>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/MinAggregate.java>,
>>> MaxAggregate
>>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/MaxAggregate.java>,
>>> CountStar
>>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/CountStar.java>,
>>> and CountNotNull
>>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/CountNonNull.java>.
>>> Null counts can be derived from CountStar and CountNotNull.
>>>
>>> Thanks,
>>> Huaxin
>>>
>>> On Fri, Aug 2, 2024 at 1:45 PM Piotr Findeisen <
>>> piotr.findei...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> First of all, thank you Huaxin for raising this topic. It's important
>>>> for Spark, but also for Trino.
>>>>
>>>> Min, max, and null counts can be derived from manifests.
>>>> I am not saying that a query engine should derive them from manifests
>>>> at query time, but it definitely can.
>>>> If we want to pull min, max, and null counts table-level summary into
>>>> stats Puffin file (which probably makes sense),  my concern would be what
>>>> the exact mechanics for that should be. NDVs need to be derived from the
>>>> data files, but maybe we can do something smarter for min, max, and null
>>>> counts.
>>>>
>>>> Best,
>>>> Piotr
>>>>
>>>>
>>>>
>>>> On Fri, 2 Aug 2024 at 20:47, Samrose Ahmed <samroseah...@gmail.com>
>>>> wrote:
>>>>
>>>>> Isn't this addressed by the partition statistics feature, or do you
>>>>> want to have one row for the entire table?
>>>>>
>>>>> On Fri, Aug 2, 2024, 10:47 AM huaxin gao <huaxin.ga...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I would like to initiate a discussion on implementing a table-level
>>>>>> statistics file to store column statistics, specifically min, max, and 
>>>>>> null
>>>>>> counts. The original discussion can be found in this Slack thread:
>>>>>> https://apache-iceberg.slack.com/archives/C03LG1D563F/p1676395480005779
>>>>>> .
>>>>>>
>>>>>> In Spark 3.4, I introduced Column Statistics
>>>>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/colstats/ColumnStatistics.java#L33>
>>>>>> within the Statistics interface
>>>>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/Statistics.java#L38>,
>>>>>> enabling Iceberg to implement and report these metrics to Spark. As a
>>>>>> result, Spark can utilize the column statistics for cost-based 
>>>>>> optimization
>>>>>> (CBO), including join reordering.
>>>>>>
>>>>>> Here’s how the process operates:
>>>>>>
>>>>>>
>>>>>>    - Use the ANALYZE TABLE command to compute column statistics and
>>>>>>    store them appropriately. The SQL syntax is:
>>>>>>
>>>>>>
>>>>>> ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS col1, col2…
>>>>>>
>>>>>> This command will trigger a ComputeTableStatsSparkAction to compute a
>>>>>> data-sketch for NDV (number of distinct values) and saves this in the
>>>>>> StatisticsFile. Here is the ComputeTableStatsAction PR
>>>>>> <https://github.com/apache/iceberg/pull/10288>
>>>>>>
>>>>>> Additionally, the ANALYZE TABLE command needs to calculate
>>>>>> table-level min, max, and null counts for columns like col1 and col2 
>>>>>> using
>>>>>> the manifest files. These table-level statistics are then saved. 
>>>>>> Currently,
>>>>>> we do not have a designated location to store these table-level column
>>>>>> statistics. The StatisticsFile
>>>>>> <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/StatisticsFile.java>,
>>>>>> being tightly coupled with the Puffin file, does not seem to be a 
>>>>>> suitable
>>>>>> repository for these column stats.
>>>>>>
>>>>>>
>>>>>>
>>>>>>    - If Spark's CBO is on and SparkSQLProperties.REPORT_COLUMN_STATS
>>>>>>    
>>>>>> <https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java#L95>
>>>>>>    is enabled, NDV will be retrieved from the StatisticsFile, and the
>>>>>>    min/max/null counts will also be retrieved from a table-level column
>>>>>>    statistics file. These statistics will be utilized to construct the
>>>>>>    Statistics object in estimateStatistics
>>>>>>    
>>>>>> <https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java#L186>.
>>>>>>    Spark then employs the statistics returned from estimateStatistics
>>>>>>    
>>>>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala#L86>
>>>>>>    for cost-based optimization (CBO).
>>>>>>
>>>>>>
>>>>>> Please share your thoughts on this.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Huaxin
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>

Re: [DISCUSS] Implementing a table-level statistics file to store column statistics

Reply via email to