laichunpongben opened a new issue, #2362:
URL: https://github.com/apache/iceberg-rust/issues/2362

   ## Summary
   
   The write path (including `ParquetWriter::parquet_files_to_data_files` and 
`ParquetWriter::close`) currently stores the full, untruncated values of a 
column's minimum and maximum in the manifest's `lower_bounds` / `upper_bounds` 
maps. The Iceberg spec (and every other official implementation — Java, Python, 
pyiceberg) applies a configurable truncation step at manifest serialization 
time, controlled by the `write.metadata.metrics.*` table properties, with 
`truncate(16)` as the default for binary and string columns.
   
   iceberg-rust has no write-side metrics-mode framework at all: neither the 
constants, the `MetricsMode` parser, nor the truncation helpers exist, and 
`MinMaxColAggregator::produce()` is not aware of table properties.
   
   This is strictly a write-path gap. The read path already understands 
truncated bounds: `inclusive_metrics_evaluator.rs` and 
`page_index_evaluator.rs` both truncate filter datums when comparing against 
possibly-truncated manifest bounds, so reads against tables written by other 
implementations work correctly. Tables written by iceberg-rust are also read 
correctly by pyiceberg/Java, because untruncated bounds are more precise and 
still valid. So this is not a correctness regression — but it is a 
spec-conformance and manifest-size gap.
   
   ## Reproduction (against a REST catalog + S3)
   
   Using the `dev/docker-compose.yaml` fixture:
   
   1. Create a partitioned table with a `string` column carrying, say, 32-byte 
values.
   2. Write a Parquet file where the column's min/max is the same 32-byte 
string.
   3. Convert to `DataFile` via `ParquetWriter::parquet_files_to_data_files`.
   4. Inspect the resulting `DataFile.lower_bounds` / `.upper_bounds` for the 
string field:
      - **iceberg-rust (current)**: both entries store all 32 bytes verbatim.
      - **pyiceberg (reference)**: `lower_bounds` stores the first 16 bytes; 
`upper_bounds` stores the first 16 bytes with the final codepoint incremented 
(so the truncated value remains a valid upper bound).
   
   ## Reference implementation (pyiceberg, Apache-2.0)
   
   ### Property constants (`pyiceberg/table/__init__.py`)
   
   ```python
   class TableProperties:
       DEFAULT_WRITE_METRICS_MODE = "write.metadata.metrics.default"
       DEFAULT_WRITE_METRICS_MODE_DEFAULT = "truncate(16)"
       METRICS_MODE_COLUMN_CONF_PREFIX = "write.metadata.metrics.column"
   ```
   
   ### Mode parser (`pyiceberg/io/pyarrow.py`)
   
   ```python
   class MetricModeTypes(Enum):
       TRUNCATE = "truncate"
       NONE = "none"
       COUNTS = "counts"
       FULL = "full"
   ```
   
   ### Upper-bound increment (`pyiceberg/utils/truncate.py`)
   
   ```python
   def truncate_upper_bound_text_string(value: str, trunc_length: int | None) 
-> str | None:
       result = value[:trunc_length]
       if result != value:
           chars = [*result]
           for i in range(-1, -len(result) - 1, -1):
               try:
                   chars[i] = chr(ord(chars[i]) + 1)
                   return "".join(chars)
               except ValueError:
                   pass  # this codepoint was at max; try the previous position
           return None
       return result
   ```
   
   Binary variant is the same structure, operating on bytes with `< 255` check.
   
   ### Per-column override
   
   Each primitive field is visited; the effective mode is 
`write.metadata.metrics.column.<name>` if set, else the default. 
Non-string/binary columns with mode `truncate` auto-upgrade to `full`; nested 
columns downgrade to `counts`.
   
   ## Proposed scope
   
   Minimum viable change that the maintainers might accept in one PR:
   
   1. **Constants** — add to `spec/table_properties.rs`:
      - `PROPERTY_WRITE_METRICS_MODE` / `..._DEFAULT = "truncate(16)"`
      - `PROPERTY_METRICS_MODE_COLUMN_CONF_PREFIX`
   
   2. **Type** — `MetricsMode` enum + a `match_metrics_mode(s: &str) -> 
Result<MetricsMode>` that parses `truncate(N)`, `none`, `counts`, `full`.
   
   3. **Plan** — a `compute_metrics_plan(schema: &SchemaRef, properties: 
&HashMap<String, String>) -> HashMap<i32, MetricsMode>` mirroring pyiceberg's 
`compute_statistics_plan` (pre-order visit with nested-downgrade and 
non-text-auto-upgrade).
   
   4. **Threading** — pass `TableMetadata` (or the precomputed plan) into 
`parquet_to_data_file_builder` so the aggregator can see it.
   
   5. **Aggregator** — `MinMaxColAggregator` gains a per-field mode, and its 
bound-producing methods apply `truncate(N)` for strings/binary as described.
   
   6. **Helpers** — port `truncate_upper_bound_text_string` / 
`truncate_upper_bound_binary_string` into a new `spec::truncate_bound` (or 
similar) module. Rust-native UTF-8: iterate by `chars()` rather than bytes for 
the string variant, matching pyiceberg's semantics.
   
   7. **Tests** — unit coverage per mode (`full`, `counts`, `none`, 
`truncate(N)`) × per primitive type; integration test that round-trips a known 
string against pyiceberg's output (the docker-compose fixture makes this 
straightforward).
   
   ## A non-negotiable constraint
   
   `partition_value_from_bounds` (added in #1079, wired into the add-files flow 
by its follow-up) **must continue to see untruncated bounds**. Identity 
partitioning on a string column relies on `min == max` to detect 
single-partition files; if truncation is applied before partition inference, 
two files with, say, `"AAPL-2026-01-15-200-C"` rows would falsely fail with 
`"more than one partition values"` because the truncated-and-incremented upper 
bound differs from the truncated lower bound.
   
   pyiceberg maintains this separation cleanly: `StatsAggregator.current_min` / 
`current_max` are the raw values (consumed by `_partition_value`), and 
truncation is applied only inside `min_as_bytes()` / `max_as_bytes()` on the 
serialization path. Any fix here should follow the same layering — bounds stay 
full-precision in the aggregator's public output; truncation happens at the 
DataFile-serialization boundary.
   
   ## References
   
   - Spec: https://iceberg.apache.org/spec/#metrics-modes
   - pyiceberg reference implementation: `pyiceberg/io/pyarrow.py` 
(`StatsAggregator`, `compute_statistics_plan`) and `pyiceberg/utils/truncate.py`
   - Related work in this repo: #1079 (partition-from-bounds helper), #1035 
(parent issue, closed), #960 (original add-existing-parquet-files PR)
   - Read path already handling truncated bounds: 
`expr/visitors/inclusive_metrics_evaluator.rs`, 
`expr/visitors/page_index_evaluator.rs`
   
   Happy to take this on if there is interest and no duplicate work underway.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to