[PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]

via GitHub Thu, 30 Nov 2023 22:39:50 -0800


majian1998 opened a new pull request, #10226:
URL: https://github.com/apache/hudi/pull/10226


   ### Change Logs
   
   In HUDI-7110 , a tool has been made available to display column stats. 
However, this tool is not very user-friendly for manual observation when 
dealing with large data volumes. For instance, with tens of thousands of 
parquet files, the number of rows in column stats could be of the order of 
hundreds of thousands. This renders the data virtually unreadable to humans, 
necessitating further processing by code. Yet, if an administrator simply 
wishes to directly observe the data layout based on column stats under such 
circumstances, a more intuitive display tool is required. Here, we offer a tool 
that calculates the overlap degree of column stats based on partition and 
column name.
   
   Overlap degree refers to the extent to which the min-max ranges of different 
files intersect with each other. This directly affects the effectiveness of 
data skipping.
   
   In fact, a similar concept is also provided by Snowflake to aid their 
clustering process. 
https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions Our 
implementation here is not overly complex.
   
   It yields output similar to the following:
   
   |Partition path |Field name |Average overlap  |Maximum file overlap |Total 
file number |50% overlap        |75% overlap        |95% overlap        |99% 
overlap        |Total value number |
   |path           |c8         |1.33             |2                   |2        
        |1                 |1                 |1                 |1             
    |3                  |
   
   
   
   This content provides a straightforward representation of the relevant 
statistics.
   
   For example, consider three files: a.parquet, b.parquet, and c.parquet. 
Taking an integer-type column 'id' as an example, the range (min-max) for 'a' 
is 1–5, for 'b' is 3–7, and for 'c' is 7–8. Thus, there will be overlap within 
the ranges 3–5 and 7.
   
   If the filter conditions for 'id' during data skipping include these values, 
multiple files will be filtered out. For a simpler case, if it's an equality 
query, 2 files will be filtered within these ranges, and no more than one file 
will be filtered in other cases (possibly outside of the range).
   
   
   TODO: In the future, we hope that this foundation can inspire and be 
expanded upon to use overlap as a guide for clustering data layout.
   
   
   ### Impact
   
   None
   
   ### Risk level (write none, low medium or high below)
   
   None
   
   ### Documentation Update
   
   This procedure is designed to calculate and display the overlap degree of 
column statistics for different files within a table, which is a key factor in 
evaluating the performance of data skipping strategies.
   
   Parameters and Output Schema
   The procedure accepts the following parameters:
   
   table (StringType, required): The name of the table for which column 
statistics overlap will be calculated.
   partition (StringType, optional): A specific partition or comma-separated 
list of partitions to limit the scope of the calculation.
   targetColumns (StringType, optional): A specific column or comma-separated 
list of columns for which to calculate the statistics.
   The output of the procedure is a structured type (StructType) comprising the 
following fields, which describe various aspects of column statistics overlap 
for each field within the specified partitions or table:
   
   Partition path
   Field name
   Average overlap
   Maximum file overlap
   Total file number
   50% overlap
   75% overlap
   95% overlap
   99% overlap
   Total value number
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]

Reply via email to