laichunpongben opened a new issue, #2362:
URL: https://github.com/apache/iceberg-rust/issues/2362
## Summary
The write path (including `ParquetWriter::parquet_files_to_data_files` and
`ParquetWriter::close`) currently stores the full, untruncated values of a
column's minimum and maximum in the manifest's `lower_bounds` / `upper_bounds`
maps. The Iceberg spec (and every other official implementation — Java, Python,
pyiceberg) applies a configurable truncation step at manifest serialization
time, controlled by the `write.metadata.metrics.*` table properties, with
`truncate(16)` as the default for binary and string columns.
iceberg-rust has no write-side metrics-mode framework at all: neither the
constants, the `MetricsMode` parser, nor the truncation helpers exist, and
`MinMaxColAggregator::produce()` is not aware of table properties.
This is strictly a write-path gap. The read path already understands
truncated bounds: `inclusive_metrics_evaluator.rs` and
`page_index_evaluator.rs` both truncate filter datums when comparing against
possibly-truncated manifest bounds, so reads against tables written by other
implementations work correctly. Tables written by iceberg-rust are also read
correctly by pyiceberg/Java, because untruncated bounds are more precise and
still valid. So this is not a correctness regression — but it is a
spec-conformance and manifest-size gap.
## Reproduction (against a REST catalog + S3)
Using the `dev/docker-compose.yaml` fixture:
1. Create a partitioned table with a `string` column carrying, say, 32-byte
values.
2. Write a Parquet file where the column's min/max is the same 32-byte
string.
3. Convert to `DataFile` via `ParquetWriter::parquet_files_to_data_files`.
4. Inspect the resulting `DataFile.lower_bounds` / `.upper_bounds` for the
string field:
- **iceberg-rust (current)**: both entries store all 32 bytes verbatim.
- **pyiceberg (reference)**: `lower_bounds` stores the first 16 bytes;
`upper_bounds` stores the first 16 bytes with the final codepoint incremented
(so the truncated value remains a valid upper bound).
## Reference implementation (pyiceberg, Apache-2.0)
### Property constants (`pyiceberg/table/__init__.py`)
```python
class TableProperties:
DEFAULT_WRITE_METRICS_MODE = "write.metadata.metrics.default"
DEFAULT_WRITE_METRICS_MODE_DEFAULT = "truncate(16)"
METRICS_MODE_COLUMN_CONF_PREFIX = "write.metadata.metrics.column"
```
### Mode parser (`pyiceberg/io/pyarrow.py`)
```python
class MetricModeTypes(Enum):
TRUNCATE = "truncate"
NONE = "none"
COUNTS = "counts"
FULL = "full"
```
### Upper-bound increment (`pyiceberg/utils/truncate.py`)
```python
def truncate_upper_bound_text_string(value: str, trunc_length: int | None)
-> str | None:
result = value[:trunc_length]
if result != value:
chars = [*result]
for i in range(-1, -len(result) - 1, -1):
try:
chars[i] = chr(ord(chars[i]) + 1)
return "".join(chars)
except ValueError:
pass # this codepoint was at max; try the previous position
return None
return result
```
Binary variant is the same structure, operating on bytes with `< 255` check.
### Per-column override
Each primitive field is visited; the effective mode is
`write.metadata.metrics.column.<name>` if set, else the default.
Non-string/binary columns with mode `truncate` auto-upgrade to `full`; nested
columns downgrade to `counts`.
## Proposed scope
Minimum viable change that the maintainers might accept in one PR:
1. **Constants** — add to `spec/table_properties.rs`:
- `PROPERTY_WRITE_METRICS_MODE` / `..._DEFAULT = "truncate(16)"`
- `PROPERTY_METRICS_MODE_COLUMN_CONF_PREFIX`
2. **Type** — `MetricsMode` enum + a `match_metrics_mode(s: &str) ->
Result<MetricsMode>` that parses `truncate(N)`, `none`, `counts`, `full`.
3. **Plan** — a `compute_metrics_plan(schema: &SchemaRef, properties:
&HashMap<String, String>) -> HashMap<i32, MetricsMode>` mirroring pyiceberg's
`compute_statistics_plan` (pre-order visit with nested-downgrade and
non-text-auto-upgrade).
4. **Threading** — pass `TableMetadata` (or the precomputed plan) into
`parquet_to_data_file_builder` so the aggregator can see it.
5. **Aggregator** — `MinMaxColAggregator` gains a per-field mode, and its
bound-producing methods apply `truncate(N)` for strings/binary as described.
6. **Helpers** — port `truncate_upper_bound_text_string` /
`truncate_upper_bound_binary_string` into a new `spec::truncate_bound` (or
similar) module. Rust-native UTF-8: iterate by `chars()` rather than bytes for
the string variant, matching pyiceberg's semantics.
7. **Tests** — unit coverage per mode (`full`, `counts`, `none`,
`truncate(N)`) × per primitive type; integration test that round-trips a known
string against pyiceberg's output (the docker-compose fixture makes this
straightforward).
## A non-negotiable constraint
`partition_value_from_bounds` (added in #1079, wired into the add-files flow
by its follow-up) **must continue to see untruncated bounds**. Identity
partitioning on a string column relies on `min == max` to detect
single-partition files; if truncation is applied before partition inference,
two files with, say, `"AAPL-2026-01-15-200-C"` rows would falsely fail with
`"more than one partition values"` because the truncated-and-incremented upper
bound differs from the truncated lower bound.
pyiceberg maintains this separation cleanly: `StatsAggregator.current_min` /
`current_max` are the raw values (consumed by `_partition_value`), and
truncation is applied only inside `min_as_bytes()` / `max_as_bytes()` on the
serialization path. Any fix here should follow the same layering — bounds stay
full-precision in the aggregator's public output; truncation happens at the
DataFile-serialization boundary.
## References
- Spec: https://iceberg.apache.org/spec/#metrics-modes
- pyiceberg reference implementation: `pyiceberg/io/pyarrow.py`
(`StatsAggregator`, `compute_statistics_plan`) and `pyiceberg/utils/truncate.py`
- Related work in this repo: #1079 (partition-from-bounds helper), #1035
(parent issue, closed), #960 (original add-existing-parquet-files PR)
- Read path already handling truncated bounds:
`expr/visitors/inclusive_metrics_evaluator.rs`,
`expr/visitors/page_index_evaluator.rs`
Happy to take this on if there is interest and no duplicate work underway.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]