nastra commented on code in PR #14234:
URL: https://github.com/apache/iceberg/pull/14234#discussion_r3137823304


##########
format/spec.md:
##########
@@ -707,6 +714,119 @@ For `geography` only, xmin (X value of `lower_bounds`) 
may be greater than xmax
 
 When calculating upper and lower bounds for `geometry` and `geography`, null 
or NaN values in a coordinate dimension are skipped; for example, POINT (1 NaN) 
contributes a value to X but no values to Y, Z, or M dimension bounds. If a 
dimension has only null or NaN values, that dimension is omitted from the 
bounding box. If either the X or Y dimension is missing then the bounding box 
itself is not produced.
 
+##### Content Stats
+
+Iceberg v4 introduces content stats which represent stats in a 
`struct<struct<...>>`. The statistics for fields are tracked inside a nested 
struct of value counts and bounds (described in the next section). Each 
field-level statistics struct is a field of the `content_stats` struct, which 
holds all statistics for table fields.
+
+###### ID assignment for stats fields
+
+ID assignment follows a deterministic transform that maps from the **table ID 
space** to the **metadata ID space**. For a given field ID from the **table ID 
space** each nested stats struct gets an ID assigned from the **metadata ID 
space**.
+The offset defined in the [field stats types section](#field-stats-types) is 
added to the stats ID of the enclosing stats struct to calculate IDs for each 
individual field stats type.
+
+**Data columns (normal table field ids)**
+
+Let `table_field_id` be the column's id in the table schema. Allocate a 
contiguous block of **200** ids per column (`num_supported_stats_per_column = 
200`). The stats struct for that column starts at:
+
+`stats_struct_id = 10_000 + (200 * table_field_id)`
+
+Each field statistic listed under [Field stats types](#field-stats-types) has 
a fixed **offset** within that block. The field id for an individual field 
statistic is:
+
+`stats_field_id = stats_struct_id + offset`
+
+The constant `10_000` is `stats_space_field_id_start_for_data_fields`. The 
value **200** is both the width of each column's stats block and 
`num_reserved_field_ids` from [Reserved field ids](#reserved-field-ids).
+
+**Reserved table field ids.**
+
+Columns whose ids fall in the [reserved field ID](#reserved-field-ids) space 
use a different base so their stats ids do not overlap data columns:
+
+`stats_struct_id = 2_147_000_000 + (200 * (200 - (Integer.MAX_VALUE - 
table_field_id)))`
+
+Here `2_147_000_000` is `stats_space_field_id_start_for_metadata_fields`. This 
separate base is required because reserved ids are near `Integer.MAX_VALUE` and 
cannot use the same linear mapping as data field ids.
+
+Valid data field ids support stats structs with ids from `10_000` through 
`200_010_000`, so the highest supported **data** field id is `1_000_000`.
+
+###### Name assignment for `content_stats` fields
+
+Each nested stats struct is a **child field** of the root `content_stats` 
struct. Its **name** is the numerical string of the table column's field id 
(for example id `103` uses the name `"103"`).
+Its **field id** is deterministically calculated as defined in the previous 
section.
+
+###### Field stats types
+
+Each stats struct holds statistics for one table column. It may contain the 
following metrics:
+
+| required/optional | Offset | Name                    | Type                | 
Description                                                                     
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                             |
+|-------------------|--------|-------------------------|---------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| _optional_        | 1      | value_count             | `long`              | 
Number of values in the column (including null and NaN values)                  
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                             |
+| _optional_        | 2      | null_value_count        | `long`              | 
Number of null values in the column. Only included for optional columns         
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                             |
+| _optional_        | 3      | nan_value_count         | `long`              | 
Number of NaN values in the column. Only included for float/double types. NaN 
rules follow note 2 under [Data File Fields](#data-file-fields)                 
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                               |
+| _optional_        | 4      | avg_value_size_in_bytes | `int`               | 
Avg stored (compressed, encoded) value size in bytes for variable-length types 
(`string` / `binary`)                                                           
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                              |
+| _optional_        | 5      | max_value_size_in_bytes | `int`               | 
Max stored (compressed, encoded) value size in bytes for variable-length types 
(`string` / `binary`)                                                           
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                              |
+| _optional_        | 6      | lower_bound             | type of table field | 
Lower bound serialized as the column's type. Bounds follow rules defined in 
[Bounds for Variant, Geometry, and 
Geography](#bounds-for-variant-geometry-and-geography)                          
                                                                                
                                                                                
                                                                                
                                                                                
                                                                              |
+| _optional_        | 7      | upper_bound             | type of table field | 
Upper bound serialized as the column's type. Bounds follow rules defined in 
[Bounds for Variant, Geometry, and 
Geography](#bounds-for-variant-geometry-and-geography)                          
                                                                                
                                                                                
                                                                                
                                                                                
                                                                              |
+| _optional_        | 8      | exact_bounds            | `boolean`           | 
Whether the `lower_bound` / `upper_bound` are exact (`true`) or may be 
truncated or otherwise inexact (`false`). Defaults to `true`. Types such as 
`string` / `binary` often use `false` when bounds are truncated. For types with 
inherently exact bounds when written (for example boolean, integer, 
floating-point, date, time, timestamp, decimal, uuid, `geometry`, `geography`), 
writers should use `true` when bounds are present. If a deletion vector or 
equality delete file can match rows in the data file, implementations must 
treat bounds as inexact for pruning (`exact_bounds` as `false`) |
+
+###### Stats projection
+
+To retrieve stats for a particular table field ID, one would always project by 
stats ID, where the stats ID for a given table field ID can be calculated by 
applying the reverse calculation.
+For data columns the reverse calculation would be:
+
+`table_field_id = (stats_struct_id - 10_000) / 200`
+
+For [reserved field IDs](#reserved-field-ids), the reverse calculation would 
be:
+
+`table_field_id = stats_struct_id - num_reserved_field_ids + 
(Integer.MAX_VALUE - stats_struct_id) + (stats_struct_id - 
stats_space_field_id_start_for_metadata_fields) / 
num_supported_stats_per_column`
+
+using `num_reserved_field_ids = 200`, 
`stats_space_field_id_start_for_metadata_fields = 2_147_000_000`, and 
`num_supported_stats_per_column = 200` (see [ID assignment for stats 
fields](#id-assignment-for-stats-fields)).
+
+Below are examples for some table field ID -> stats struct id calculations.
+
+| Table Field ID      | Stats ID of Stats struct  |
+|---------------------|---------------------------|
+| 0                   | 10_000                    |
+| 1                   | 10_200                    |
+| 2                   | 10_400                    |
+| 5                   | 11_000                    |
+| 100                 | 30_000                    |
+| 1_000_000           | 200_010_000               |
+
+| Reserved Field ID   | Stats ID of Stats struct  |
+|---------------------|---------------------------|
+| 2_147_483_447       | 2_147_000_000             |
+| 2_147_483_448       | 2_147_000_200             |
+| 2_147_483_541       | 2_147_018_800             |
+| 2_147_483_645       | 2_147_039_600             |
+| 2_147_483_646       | 2_147_039_800             |
+
+The below table shows the stats IDs of individual field statistics, which are 
calculated based on the offset that is described in the [Field stats types 
section](#field-stats-types)
+
+| Table Field ID | Stats ID of Stats struct | Stats Type              | Stats 
ID of individual statistic |
+|----------------|--------------------------|-------------------------|----------------------------------|
+| 2              | 10_400                   | value_count             | 10_401 
                          |
+|                |                          | null_value_count        | 10_402 
                          |
+|                |                          | nan_value_count         | 10_403 
                          |
+|                |                          | avg_value_size_in_bytes | 10_404 
                          |
+|                |                          | max_value_size_in_bytes | 10_405 
                          |
+|                |                          | lower_bound             | 10_406 
                          |
+|                |                          | upper_bound             | 10_407 
                          |
+|                |                          | exact_bounds            | 10_408 
                          |
+| 5              | 11_000                   | value_count             | 11_001 
                          |
+|                |                          | null_value_count        | 11_002 
                          |
+|                |                          | nan_value_count         | 11_003 
                          |
+|                |                          | avg_value_size_in_bytes | 11_004 
                          |
+|                |                          | max_value_size_in_bytes | 11_005 
                          |
+|                |                          | lower_bound             | 11_006 
                          |
+|                |                          | upper_bound             | 11_007 
                          |
+|                |                          | exact_bounds            | 11_008 
                          |
+
+###### Manifest schema and `content_stats` typing

Review Comment:
   good point, I'll add those to this section



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to