stevenzwu commented on code in PR #14234:
URL: https://github.com/apache/iceberg/pull/14234#discussion_r3126062888


##########
format/spec.md:
##########
@@ -683,6 +683,13 @@ Notes:
 5. The `content_offset` and `content_size_in_bytes` fields are used to 
reference a specific blob for direct access to a deletion vector. For deletion 
vectors, these values are required and must exactly match the `offset` and 
`length` stored in the Puffin footer for the deletion vector blob.
 6. The following field ids are reserved on `data_file`: 141.
 
+###### File-level column statistics

Review Comment:
   Do we need to add update the fields table starting in line 646? add a new 
column for v4, add the `content_stats` struct field, mark the map stats fields 
as deprecated in v4?



##########
format/spec.md:
##########
@@ -683,6 +683,13 @@ Notes:
 5. The `content_offset` and `content_size_in_bytes` fields are used to 
reference a specific blob for direct access to a deletion vector. For deletion 
vectors, these values are required and must exactly match the `offset` and 
`length` stored in the Puffin footer for the deletion vector blob.
 6. The following field ids are reserved on `data_file`: 141.
 
+###### File-level column statistics
+
+Per-column metrics used for filtering and planning are stored at **file** 
granularity on the `data_file` struct.
+In v3, implementations use maps such as `value_counts`, `lower_bounds`, and 
`upper_bounds`, keyed by column id, with bounds serialized as binary (see note 
1 under [Data File Fields](#data-file-fields)).
+Iceberg v4 adds the optional `content_stats` struct, which holds the same 
*logical* metrics for primitive leaf columns using nested structs and typed 
bounds (see [Content Stats](#content-stats)).
+The subsections below define shared bound serialization for variant, geometry, 
and geography, and then define v4 `content_stats`; the same bound rules apply 
whether values are read from v3 maps or from v4 field stats.

Review Comment:
   not sure if this paragraph is needed



##########
format/spec.md:
##########
@@ -707,6 +714,119 @@ For `geography` only, xmin (X value of `lower_bounds`) 
may be greater than xmax
 
 When calculating upper and lower bounds for `geometry` and `geography`, null 
or NaN values in a coordinate dimension are skipped; for example, POINT (1 NaN) 
contributes a value to X but no values to Y, Z, or M dimension bounds. If a 
dimension has only null or NaN values, that dimension is omitted from the 
bounding box. If either the X or Y dimension is missing then the bounding box 
itself is not produced.
 
+##### Content Stats
+
+Iceberg v4 introduces content stats which represent stats in a 
`struct<struct<...>>`. The statistics for fields are tracked inside a nested 
struct of value counts and bounds (described in the next section). Each 
field-level statistics struct is a field of the `content_stats` struct, which 
holds all statistics for table fields.

Review Comment:
   > Iceberg v4 introduces content stats which represent stats in a 
`struct<struct<...>>`
   
   a nit suggestion
   ```
   In v4, column stats are represented in a `struct<struct<...>>`.
   ```



##########
format/spec.md:
##########
@@ -707,6 +714,119 @@ For `geography` only, xmin (X value of `lower_bounds`) 
may be greater than xmax
 
 When calculating upper and lower bounds for `geometry` and `geography`, null 
or NaN values in a coordinate dimension are skipped; for example, POINT (1 NaN) 
contributes a value to X but no values to Y, Z, or M dimension bounds. If a 
dimension has only null or NaN values, that dimension is omitted from the 
bounding box. If either the X or Y dimension is missing then the bounding box 
itself is not produced.
 
+##### Content Stats
+
+Iceberg v4 introduces content stats which represent stats in a 
`struct<struct<...>>`. The statistics for fields are tracked inside a nested 
struct of value counts and bounds (described in the next section). Each 
field-level statistics struct is a field of the `content_stats` struct, which 
holds all statistics for table fields.
+
+###### ID assignment for stats fields
+
+ID assignment follows a deterministic transform that maps from the **table ID 
space** to the **metadata ID space**. For a given field ID from the **table ID 
space** each nested stats struct gets an ID assigned from the **metadata ID 
space**.
+The offset defined in the [field stats types section](#field-stats-types) is 
added to the stats ID of the enclosing stats struct to calculate IDs for each 
individual field stats type.

Review Comment:
   This sentence is a bit hard to read for me. Maybe sth like
   
   ```
   Each metric listed in [Field stats types](...) has a fixed offset. Its field 
ID is the enclosing stats struct's ID plus that offset.
   ```



##########
format/spec.md:
##########
@@ -707,6 +714,119 @@ For `geography` only, xmin (X value of `lower_bounds`) 
may be greater than xmax
 
 When calculating upper and lower bounds for `geometry` and `geography`, null 
or NaN values in a coordinate dimension are skipped; for example, POINT (1 NaN) 
contributes a value to X but no values to Y, Z, or M dimension bounds. If a 
dimension has only null or NaN values, that dimension is omitted from the 
bounding box. If either the X or Y dimension is missing then the bounding box 
itself is not produced.
 
+##### Content Stats
+
+Iceberg v4 introduces content stats which represent stats in a 
`struct<struct<...>>`. The statistics for fields are tracked inside a nested 
struct of value counts and bounds (described in the next section). Each 
field-level statistics struct is a field of the `content_stats` struct, which 
holds all statistics for table fields.
+
+###### ID assignment for stats fields
+
+ID assignment follows a deterministic transform that maps from the **table ID 
space** to the **metadata ID space**. For a given field ID from the **table ID 
space** each nested stats struct gets an ID assigned from the **metadata ID 
space**.

Review Comment:
   nit: `a deterministic transform that maps` -> `a deterministic mapping`.
   
   Also the second sentence `For a given ...` seems to be saying the same thing 
as the first setence.



##########
format/spec.md:
##########
@@ -707,6 +714,119 @@ For `geography` only, xmin (X value of `lower_bounds`) 
may be greater than xmax
 
 When calculating upper and lower bounds for `geometry` and `geography`, null 
or NaN values in a coordinate dimension are skipped; for example, POINT (1 NaN) 
contributes a value to X but no values to Y, Z, or M dimension bounds. If a 
dimension has only null or NaN values, that dimension is omitted from the 
bounding box. If either the X or Y dimension is missing then the bounding box 
itself is not produced.
 
+##### Content Stats
+
+Iceberg v4 introduces content stats which represent stats in a 
`struct<struct<...>>`. The statistics for fields are tracked inside a nested 
struct of value counts and bounds (described in the next section). Each 
field-level statistics struct is a field of the `content_stats` struct, which 
holds all statistics for table fields.
+
+###### ID assignment for stats fields
+
+ID assignment follows a deterministic transform that maps from the **table ID 
space** to the **metadata ID space**. For a given field ID from the **table ID 
space** each nested stats struct gets an ID assigned from the **metadata ID 
space**.
+The offset defined in the [field stats types section](#field-stats-types) is 
added to the stats ID of the enclosing stats struct to calculate IDs for each 
individual field stats type.
+
+**Data columns (normal table field ids)**

Review Comment:
   just `Data fields` as section title?



##########
format/spec.md:
##########
@@ -707,6 +714,119 @@ For `geography` only, xmin (X value of `lower_bounds`) 
may be greater than xmax
 
 When calculating upper and lower bounds for `geometry` and `geography`, null 
or NaN values in a coordinate dimension are skipped; for example, POINT (1 NaN) 
contributes a value to X but no values to Y, Z, or M dimension bounds. If a 
dimension has only null or NaN values, that dimension is omitted from the 
bounding box. If either the X or Y dimension is missing then the bounding box 
itself is not produced.
 
+##### Content Stats
+
+Iceberg v4 introduces content stats which represent stats in a 
`struct<struct<...>>`. The statistics for fields are tracked inside a nested 
struct of value counts and bounds (described in the next section). Each 
field-level statistics struct is a field of the `content_stats` struct, which 
holds all statistics for table fields.
+
+###### ID assignment for stats fields
+
+ID assignment follows a deterministic transform that maps from the **table ID 
space** to the **metadata ID space**. For a given field ID from the **table ID 
space** each nested stats struct gets an ID assigned from the **metadata ID 
space**.
+The offset defined in the [field stats types section](#field-stats-types) is 
added to the stats ID of the enclosing stats struct to calculate IDs for each 
individual field stats type.
+
+**Data columns (normal table field ids)**
+
+Let `table_field_id` be the column's id in the table schema. Allocate a 
contiguous block of **200** ids per column (`num_supported_stats_per_column = 
200`). The stats struct for that column starts at:

Review Comment:
   nit: is `data_field_id` slightly better than `table_field_id`?
   
   `num_supported_stats_per_column` -> `max_supported_stats_per_column`



##########
format/spec.md:
##########
@@ -707,6 +714,119 @@ For `geography` only, xmin (X value of `lower_bounds`) 
may be greater than xmax
 
 When calculating upper and lower bounds for `geometry` and `geography`, null 
or NaN values in a coordinate dimension are skipped; for example, POINT (1 NaN) 
contributes a value to X but no values to Y, Z, or M dimension bounds. If a 
dimension has only null or NaN values, that dimension is omitted from the 
bounding box. If either the X or Y dimension is missing then the bounding box 
itself is not produced.
 
+##### Content Stats
+
+Iceberg v4 introduces content stats which represent stats in a 
`struct<struct<...>>`. The statistics for fields are tracked inside a nested 
struct of value counts and bounds (described in the next section). Each 
field-level statistics struct is a field of the `content_stats` struct, which 
holds all statistics for table fields.
+
+###### ID assignment for stats fields
+
+ID assignment follows a deterministic transform that maps from the **table ID 
space** to the **metadata ID space**. For a given field ID from the **table ID 
space** each nested stats struct gets an ID assigned from the **metadata ID 
space**.
+The offset defined in the [field stats types section](#field-stats-types) is 
added to the stats ID of the enclosing stats struct to calculate IDs for each 
individual field stats type.
+
+**Data columns (normal table field ids)**
+
+Let `table_field_id` be the column's id in the table schema. Allocate a 
contiguous block of **200** ids per column (`num_supported_stats_per_column = 
200`). The stats struct for that column starts at:
+
+`stats_struct_id = 10_000 + (200 * table_field_id)`
+
+Each field statistic listed under [Field stats types](#field-stats-types) has 
a fixed **offset** within that block. The field id for an individual field 
statistic is:
+
+`stats_field_id = stats_struct_id + offset`
+
+The constant `10_000` is `stats_space_field_id_start_for_data_fields`. The 
value **200** is both the width of each column's stats block and 
`num_reserved_field_ids` from [Reserved field ids](#reserved-field-ids).
+
+**Reserved table field ids.**
+
+Columns whose ids fall in the [reserved field ID](#reserved-field-ids) space 
use a different base so their stats ids do not overlap data columns:
+
+`stats_struct_id = 2_147_000_000 + (200 * (200 - (Integer.MAX_VALUE - 
table_field_id)))`

Review Comment:
   `table_field_id` should be `reserved_field_id`?



##########
format/spec.md:
##########
@@ -707,6 +714,119 @@ For `geography` only, xmin (X value of `lower_bounds`) 
may be greater than xmax
 
 When calculating upper and lower bounds for `geometry` and `geography`, null 
or NaN values in a coordinate dimension are skipped; for example, POINT (1 NaN) 
contributes a value to X but no values to Y, Z, or M dimension bounds. If a 
dimension has only null or NaN values, that dimension is omitted from the 
bounding box. If either the X or Y dimension is missing then the bounding box 
itself is not produced.
 
+##### Content Stats
+
+Iceberg v4 introduces content stats which represent stats in a 
`struct<struct<...>>`. The statistics for fields are tracked inside a nested 
struct of value counts and bounds (described in the next section). Each 
field-level statistics struct is a field of the `content_stats` struct, which 
holds all statistics for table fields.
+
+###### ID assignment for stats fields
+
+ID assignment follows a deterministic transform that maps from the **table ID 
space** to the **metadata ID space**. For a given field ID from the **table ID 
space** each nested stats struct gets an ID assigned from the **metadata ID 
space**.
+The offset defined in the [field stats types section](#field-stats-types) is 
added to the stats ID of the enclosing stats struct to calculate IDs for each 
individual field stats type.
+
+**Data columns (normal table field ids)**
+
+Let `table_field_id` be the column's id in the table schema. Allocate a 
contiguous block of **200** ids per column (`num_supported_stats_per_column = 
200`). The stats struct for that column starts at:
+
+`stats_struct_id = 10_000 + (200 * table_field_id)`
+
+Each field statistic listed under [Field stats types](#field-stats-types) has 
a fixed **offset** within that block. The field id for an individual field 
statistic is:
+
+`stats_field_id = stats_struct_id + offset`
+
+The constant `10_000` is `stats_space_field_id_start_for_data_fields`. The 
value **200** is both the width of each column's stats block and 
`num_reserved_field_ids` from [Reserved field ids](#reserved-field-ids).
+
+**Reserved table field ids.**
+
+Columns whose ids fall in the [reserved field ID](#reserved-field-ids) space 
use a different base so their stats ids do not overlap data columns:
+
+`stats_struct_id = 2_147_000_000 + (200 * (200 - (Integer.MAX_VALUE - 
table_field_id)))`
+
+Here `2_147_000_000` is `stats_space_field_id_start_for_metadata_fields`. This 
separate base is required because reserved ids are near `Integer.MAX_VALUE` and 
cannot use the same linear mapping as data field ids.
+
+Valid data field ids support stats structs with ids from `10_000` through 
`200_010_000`, so the highest supported **data** field id is `1_000_000`.
+
+###### Name assignment for `content_stats` fields
+
+Each nested stats struct is a **child field** of the root `content_stats` 
struct. Its **name** is the numerical string of the table column's field id 
(for example id `103` uses the name `"103"`).
+Its **field id** is deterministically calculated as defined in the previous 
section.
+
+###### Field stats types
+
+Each stats struct holds statistics for one table column. It may contain the 
following metrics:
+
+| required/optional | Offset | Name                    | Type                | 
Description                                                                     
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                             |
+|-------------------|--------|-------------------------|---------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| _optional_        | 1      | value_count             | `long`              | 
Number of values in the column (including null and NaN values)                  
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                             |
+| _optional_        | 2      | null_value_count        | `long`              | 
Number of null values in the column. Only included for optional columns         
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                             |
+| _optional_        | 3      | nan_value_count         | `long`              | 
Number of NaN values in the column. Only included for float/double types. NaN 
rules follow note 2 under [Data File Fields](#data-file-fields)                 
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                               |
+| _optional_        | 4      | avg_value_size_in_bytes | `int`               | 
Avg stored (compressed, encoded) value size in bytes for variable-length types 
(`string` / `binary`)                                                           
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                              |
+| _optional_        | 5      | max_value_size_in_bytes | `int`               | 
Max stored (compressed, encoded) value size in bytes for variable-length types 
(`string` / `binary`)                                                           
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                              |
+| _optional_        | 6      | lower_bound             | type of table field | 
Lower bound serialized as the column's type. Bounds follow rules defined in 
[Bounds for Variant, Geometry, and 
Geography](#bounds-for-variant-geometry-and-geography)                          
                                                                                
                                                                                
                                                                                
                                                                                
                                                                              |
+| _optional_        | 7      | upper_bound             | type of table field | 
Upper bound serialized as the column's type. Bounds follow rules defined in 
[Bounds for Variant, Geometry, and 
Geography](#bounds-for-variant-geometry-and-geography)                          
                                                                                
                                                                                
                                                                                
                                                                                
                                                                              |
+| _optional_        | 8      | exact_bounds            | `boolean`           | 
Whether the `lower_bound` / `upper_bound` are exact (`true`) or may be 
truncated or otherwise inexact (`false`). Defaults to `true`. Types such as 
`string` / `binary` often use `false` when bounds are truncated. For types with 
inherently exact bounds when written (for example boolean, integer, 
floating-point, date, time, timestamp, decimal, uuid, `geometry`, `geography`), 
writers should use `true` when bounds are present. If a deletion vector or 
equality delete file can match rows in the data file, implementations must 
treat bounds as inexact for pruning (`exact_bounds` as `false`) |
+
+###### Stats projection
+
+To retrieve stats for a particular table field ID, one would always project by 
stats ID, where the stats ID for a given table field ID can be calculated by 
applying the reverse calculation.
+For data columns the reverse calculation would be:
+
+`table_field_id = (stats_struct_id - 10_000) / 200`
+
+For [reserved field IDs](#reserved-field-ids), the reverse calculation would 
be:
+
+`table_field_id = stats_struct_id - num_reserved_field_ids + 
(Integer.MAX_VALUE - stats_struct_id) + (stats_struct_id - 
stats_space_field_id_start_for_metadata_fields) / 
num_supported_stats_per_column`

Review Comment:
   inconsistent with line 773 above. plug in the actual numbers here?



##########
format/spec.md:
##########
@@ -707,6 +714,119 @@ For `geography` only, xmin (X value of `lower_bounds`) 
may be greater than xmax
 
 When calculating upper and lower bounds for `geometry` and `geography`, null 
or NaN values in a coordinate dimension are skipped; for example, POINT (1 NaN) 
contributes a value to X but no values to Y, Z, or M dimension bounds. If a 
dimension has only null or NaN values, that dimension is omitted from the 
bounding box. If either the X or Y dimension is missing then the bounding box 
itself is not produced.
 
+##### Content Stats
+
+Iceberg v4 introduces content stats which represent stats in a 
`struct<struct<...>>`. The statistics for fields are tracked inside a nested 
struct of value counts and bounds (described in the next section). Each 
field-level statistics struct is a field of the `content_stats` struct, which 
holds all statistics for table fields.
+
+###### ID assignment for stats fields
+
+ID assignment follows a deterministic transform that maps from the **table ID 
space** to the **metadata ID space**. For a given field ID from the **table ID 
space** each nested stats struct gets an ID assigned from the **metadata ID 
space**.
+The offset defined in the [field stats types section](#field-stats-types) is 
added to the stats ID of the enclosing stats struct to calculate IDs for each 
individual field stats type.
+
+**Data columns (normal table field ids)**
+
+Let `table_field_id` be the column's id in the table schema. Allocate a 
contiguous block of **200** ids per column (`num_supported_stats_per_column = 
200`). The stats struct for that column starts at:
+
+`stats_struct_id = 10_000 + (200 * table_field_id)`
+
+Each field statistic listed under [Field stats types](#field-stats-types) has 
a fixed **offset** within that block. The field id for an individual field 
statistic is:
+
+`stats_field_id = stats_struct_id + offset`
+
+The constant `10_000` is `stats_space_field_id_start_for_data_fields`. The 
value **200** is both the width of each column's stats block and 
`num_reserved_field_ids` from [Reserved field ids](#reserved-field-ids).

Review Comment:
   it doesn't seem necessary to define 
`stats_space_field_id_start_for_data_fields` , as it is not referenced later.
   
   The second sentence may be simplified a bit. e.g.
   ```
   max_supported_stats_per_column (200) applies to both data fields and 
[reserved metadata fields]()
   ```



##########
format/spec.md:
##########
@@ -707,6 +714,119 @@ For `geography` only, xmin (X value of `lower_bounds`) 
may be greater than xmax
 
 When calculating upper and lower bounds for `geometry` and `geography`, null 
or NaN values in a coordinate dimension are skipped; for example, POINT (1 NaN) 
contributes a value to X but no values to Y, Z, or M dimension bounds. If a 
dimension has only null or NaN values, that dimension is omitted from the 
bounding box. If either the X or Y dimension is missing then the bounding box 
itself is not produced.
 
+##### Content Stats
+
+Iceberg v4 introduces content stats which represent stats in a 
`struct<struct<...>>`. The statistics for fields are tracked inside a nested 
struct of value counts and bounds (described in the next section). Each 
field-level statistics struct is a field of the `content_stats` struct, which 
holds all statistics for table fields.
+
+###### ID assignment for stats fields
+
+ID assignment follows a deterministic transform that maps from the **table ID 
space** to the **metadata ID space**. For a given field ID from the **table ID 
space** each nested stats struct gets an ID assigned from the **metadata ID 
space**.
+The offset defined in the [field stats types section](#field-stats-types) is 
added to the stats ID of the enclosing stats struct to calculate IDs for each 
individual field stats type.
+
+**Data columns (normal table field ids)**
+
+Let `table_field_id` be the column's id in the table schema. Allocate a 
contiguous block of **200** ids per column (`num_supported_stats_per_column = 
200`). The stats struct for that column starts at:
+
+`stats_struct_id = 10_000 + (200 * table_field_id)`
+
+Each field statistic listed under [Field stats types](#field-stats-types) has 
a fixed **offset** within that block. The field id for an individual field 
statistic is:
+
+`stats_field_id = stats_struct_id + offset`
+
+The constant `10_000` is `stats_space_field_id_start_for_data_fields`. The 
value **200** is both the width of each column's stats block and 
`num_reserved_field_ids` from [Reserved field ids](#reserved-field-ids).
+
+**Reserved table field ids.**

Review Comment:
   nit: `Reserved Metadata Fields` for section title?



##########
format/spec.md:
##########
@@ -707,6 +714,119 @@ For `geography` only, xmin (X value of `lower_bounds`) 
may be greater than xmax
 
 When calculating upper and lower bounds for `geometry` and `geography`, null 
or NaN values in a coordinate dimension are skipped; for example, POINT (1 NaN) 
contributes a value to X but no values to Y, Z, or M dimension bounds. If a 
dimension has only null or NaN values, that dimension is omitted from the 
bounding box. If either the X or Y dimension is missing then the bounding box 
itself is not produced.
 
+##### Content Stats
+
+Iceberg v4 introduces content stats which represent stats in a 
`struct<struct<...>>`. The statistics for fields are tracked inside a nested 
struct of value counts and bounds (described in the next section). Each 
field-level statistics struct is a field of the `content_stats` struct, which 
holds all statistics for table fields.
+
+###### ID assignment for stats fields
+
+ID assignment follows a deterministic transform that maps from the **table ID 
space** to the **metadata ID space**. For a given field ID from the **table ID 
space** each nested stats struct gets an ID assigned from the **metadata ID 
space**.
+The offset defined in the [field stats types section](#field-stats-types) is 
added to the stats ID of the enclosing stats struct to calculate IDs for each 
individual field stats type.
+
+**Data columns (normal table field ids)**
+
+Let `table_field_id` be the column's id in the table schema. Allocate a 
contiguous block of **200** ids per column (`num_supported_stats_per_column = 
200`). The stats struct for that column starts at:
+
+`stats_struct_id = 10_000 + (200 * table_field_id)`
+
+Each field statistic listed under [Field stats types](#field-stats-types) has 
a fixed **offset** within that block. The field id for an individual field 
statistic is:
+
+`stats_field_id = stats_struct_id + offset`
+
+The constant `10_000` is `stats_space_field_id_start_for_data_fields`. The 
value **200** is both the width of each column's stats block and 
`num_reserved_field_ids` from [Reserved field ids](#reserved-field-ids).
+
+**Reserved table field ids.**
+
+Columns whose ids fall in the [reserved field ID](#reserved-field-ids) space 
use a different base so their stats ids do not overlap data columns:

Review Comment:
   nit: slight rewording
   ```
   [Reserved metadata fields](...) use a different starting base for their 
stats field ids that are not allowed to overlap with data field stats ids.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to