majian1998 opened a new pull request, #10226: URL: https://github.com/apache/hudi/pull/10226
### Change Logs In HUDI-7110 , a tool has been made available to display column stats. However, this tool is not very user-friendly for manual observation when dealing with large data volumes. For instance, with tens of thousands of parquet files, the number of rows in column stats could be of the order of hundreds of thousands. This renders the data virtually unreadable to humans, necessitating further processing by code. Yet, if an administrator simply wishes to directly observe the data layout based on column stats under such circumstances, a more intuitive display tool is required. Here, we offer a tool that calculates the overlap degree of column stats based on partition and column name. Overlap degree refers to the extent to which the min-max ranges of different files intersect with each other. This directly affects the effectiveness of data skipping. In fact, a similar concept is also provided by Snowflake to aid their clustering process. https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions Our implementation here is not overly complex. It yields output similar to the following: |Partition path |Field name |Average overlap |Maximum file overlap |Total file number |50% overlap |75% overlap |95% overlap |99% overlap |Total value number | |path |c8 |1.33 |2 |2 |1 |1 |1 |1 |3 | This content provides a straightforward representation of the relevant statistics. For example, consider three files: a.parquet, b.parquet, and c.parquet. Taking an integer-type column 'id' as an example, the range (min-max) for 'a' is 1–5, for 'b' is 3–7, and for 'c' is 7–8. Thus, there will be overlap within the ranges 3–5 and 7. If the filter conditions for 'id' during data skipping include these values, multiple files will be filtered out. For a simpler case, if it's an equality query, 2 files will be filtered within these ranges, and no more than one file will be filtered in other cases (possibly outside of the range). TODO: In the future, we hope that this foundation can inspire and be expanded upon to use overlap as a guide for clustering data layout. ### Impact None ### Risk level (write none, low medium or high below) None ### Documentation Update This procedure is designed to calculate and display the overlap degree of column statistics for different files within a table, which is a key factor in evaluating the performance of data skipping strategies. Parameters and Output Schema The procedure accepts the following parameters: table (StringType, required): The name of the table for which column statistics overlap will be calculated. partition (StringType, optional): A specific partition or comma-separated list of partitions to limit the scope of the calculation. targetColumns (StringType, optional): A specific column or comma-separated list of columns for which to calculate the statistics. The output of the procedure is a structured type (StructType) comprising the following fields, which describe various aspects of column statistics overlap for each field within the specified partitions or table: Partition path Field name Average overlap Maximum file overlap Total file number 50% overlap 75% overlap 95% overlap 99% overlap Total value number ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org