CTTY commented on code in PR #2384: URL: https://github.com/apache/iceberg-rust/pull/2384#discussion_r3164586355
########## docs/rfcs/0003_file_format_api.md: ########## @@ -0,0 +1,520 @@ +<!-- + ~ Licensed to the Apache Software Foundation (ASF) under one + ~ or more contributor license agreements. See the NOTICE file + ~ distributed with this work for additional information + ~ regarding copyright ownership. The ASF licenses this file + ~ to you under the Apache License, Version 2.0 (the + ~ "License"); you may not use this file except in compliance + ~ with the License. You may obtain a copy of the License at + ~ + ~ http://www.apache.org/licenses/LICENSE-2.0 + ~ + ~ Unless required by applicable law or agreed to in writing, + ~ software distributed under the License is distributed on an + ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + ~ KIND, either express or implied. See the License for the + ~ specific language governing permissions and limitations + ~ under the License. +--> + +# RFC: File Format API for Apache Iceberg Rust + +## Background + +### Current state + +The `iceberg` crate (version 0.9.0, Rust 1.92 as of this writing) is the core library of the Apache Iceberg Rust project. Its module layout in `crates/iceberg/src/lib.rs` exposes public modules for `spec`, `arrow`, `writer`, `scan`, `io`, `expr`, `transaction`, and others. The crate depends directly on the `parquet` crate (with the `async` feature) and on the `arrow-*` crates. It has no feature flags today. + +For data file writing, the crate provides a three-layer architecture described in `crates/iceberg/src/writer/mod.rs`: + +1. **`FileWriter` / `FileWriterBuilder`** in `writer/file_writer/mod.rs`: traits for physical file writers, generic over an output type (defaulting to `Vec<DataFileBuilder>`). `FileWriterBuilder::build` takes an `OutputFile` and returns a `FileWriter`. `FileWriter::write` takes a `&RecordBatch`. + +2. **`IcebergWriter` / `IcebergWriterBuilder`** in `writer/mod.rs`: traits for logical Iceberg writers (data files, equality deletes, position deletes, partitioning), defaulting to `RecordBatch` input and `Vec<DataFile>` output. + +3. **Concrete implementations**: `ParquetWriterBuilder` and `ParquetWriter` in `writer/file_writer/parquet_writer.rs` are the only `FileWriter` implementation. Higher-level writers such as `DataFileWriterBuilder` (`writer/base_writer/data_file_writer.rs`) and `EqualityDeleteFileWriterBuilder` (`writer/base_writer/equality_delete_writer.rs`) are generic over any `FileWriterBuilder`, but every example, test, and integration instantiates them with `ParquetWriterBuilder`. + +The trait layer is format-agnostic. Every concrete instantiation uses Parquet. + +For data file reading, the crate provides `ArrowReaderBuilder` and `ArrowReader` in `arrow/reader.rs`. Both are Parquet-specific despite the generic name. `ArrowReader::process_file_scan_task` calls `open_parquet_file` directly, constructs a `ParquetRecordBatchReaderBuilder`, and uses Parquet-specific row group filtering and page index logic. `TableScan::to_arrow` in `scan/mod.rs` wires `ArrowReaderBuilder` in as the only reader path. `FileScanTask` carries a `data_file_format: DataFileFormat` field, but `process_file_scan_task` ignores it. `DataFileFormat` appears throughout `src/`. Almost every non-test reference is `DataFileFormat::Parquet`. The only non-Parquet references are two uses of `DataFileFormat::Avro` in `transaction/snapshot.rs` for manifest files. + +The `DataFileFormat` enum in `spec/manifest/data_file.rs` has four variants: `Avro`, `Orc`, `Parquet`, and `Puffin`. Puffin files hold statistics and deletion vectors rather than row data. They are handled in `crates/iceberg/src/puffin/` and are out of scope for this RFC (see Non-Goals). Data file support today: + +| Format | Data file read | Data file write | Manifests | +|---------|----------------|-----------------|-----------| +| Parquet | Yes | Yes | No | +| Avro | No | No | Yes | +| ORC | No | No | No | + +A table containing ORC or Avro data files cannot be read from Rust today, even though both are valid per the Iceberg spec. + +Arrow `RecordBatch` is the only in-memory data representation in the crate. It is the input type for every writer trait and the output type for every reader. The `arrow/` module provides schema conversion (`schema_to_arrow_schema`, `arrow_schema_to_schema`), `RecordBatchTransformer` for schema evolution and constant column injection, and the Parquet-to-Arrow read path. Every integration crate, including `iceberg-datafusion`, consumes Arrow. The crate does not define a generic `Record` type and does not integrate with engine-specific row types such as Java's `InternalRow` or `RowData`. + +### Pain points + +1. **No extension point for new formats.** Adding ORC means editing `ArrowReader` to branch on `DataFileFormat` and threading ORC-specific logic through every layer that touches it. The write path has the same shape. + +2. **Parquet assumptions leak into generic code.** `ArrowReaderBuilder` exposes Parquet-specific options (`with_metadata_size_hint`, `with_row_group_filtering_enabled`, `with_row_selection_enabled`) that are meaningless for ORC or Avro. The name "ArrowReader" conflates the in-memory representation (Arrow) with the on-disk format (Parquet). + +3. **No format-agnostic statistics.** `ParquetWriter` computes column sizes, value counts, and min/max bounds through `MinMaxColAggregator` and `NanValueCountVisitor`, both tightly coupled to Parquet's `Statistics` type. Another format cannot produce comparable manifest metadata without a shared statistics interface. + +4. **V3 types will need per-format serialization.** The V3 spec adds Variant and Geometry types. Each format encodes them differently: Parquet uses variant shredding, ORC uses binary, Avro uses union types. Implementing either type without a format abstraction means new `match` arms in every reader and writer code path. + +### Prior work + +The Java project shipped `FormatModel<D, S>` in February 2026 (PR [#12774](https://github.com/apache/iceberg/pull/12774)) after a 10-month review. Implementations for Parquet, Avro, ORC, and Arrow followed in PRs [#15253](https://github.com/apache/iceberg/pull/15253) through [#15258](https://github.com/apache/iceberg/pull/15258). Engine migrations for Spark ([#15328](https://github.com/apache/iceberg/pull/15328)) and Flink ([#15329](https://github.com/apache/iceberg/pull/15329)) landed shortly after. + +Java's `FormatModel` carries two generic parameters: a data type `D` (Java uses `Record`, Spark `InternalRow`, Flink `RowData`, and Arrow `ColumnarBatch`) and an engine schema type `S` (Spark `StructType`, Flink `RowType`, and others). A static `FormatModelRegistry` stores implementations keyed by `Pair<FileFormat, Class<?>>` and populates itself through Java reflection at startup. The old `Parquet.WriteBuilder`, `Avro.WriteBuilder`, and `ORC.WriteBuilder` are not deprecated. The new `FormatModel` implementations wrap them rather than replace them. + +PyIceberg has an open proposal for the equivalent capability in issue [apache/iceberg-python#3100](https://github.com/apache/iceberg-python/issues/3100), with an in-progress PR at [apache/iceberg-python#3119](https://github.com/apache/iceberg-python/pull/3119). Because PyIceberg uses PyArrow as its only in-memory representation, the proposal drops Java's generic type parameters and keys the registry on file format alone. Two prior PyIceberg ORC PRs ([#790](https://github.com/apache/iceberg-python/pull/790), [#2236](https://github.com/apache/iceberg-python/pull/2236)) were closed as stale without merging, which reinforces the case for landing an abstraction layer before adding new formats. + +Design decisions in this RFC that differ from Java (no generics, single-dimension registry key, hard cutover from the old Parquet types) are justified inline where they appear. Specific Java design points that shaped those decisions are called out in Alternatives Considered. + +## Goals + +The user-facing outcome of this proposal is that every Iceberg data file and delete file flows through the same stable and extensible API. Parquet is the first format to land. ORC, Avro, and others follow on the same interface. Every goal below serves that outcome. + +1. Define a `FormatModel` trait that encapsulates format-specific read and write behavior independent of the on-disk format. + +2. Remove hard-coded Parquet assumptions from scan and write orchestration. After this work, `TableScan::to_arrow` and `DataFileWriterBuilder` dispatch through the format abstraction instead of constructing Parquet types directly. + +3. Provide a registry that maps `DataFileFormat` values to `FormatModel` implementations, so callers obtain readers and writers without naming the concrete format type. Review Comment: I think this is more of an implementation details ########## docs/rfcs/0003_file_format_api.md: ########## @@ -0,0 +1,520 @@ +<!-- + ~ Licensed to the Apache Software Foundation (ASF) under one + ~ or more contributor license agreements. See the NOTICE file + ~ distributed with this work for additional information + ~ regarding copyright ownership. The ASF licenses this file + ~ to you under the Apache License, Version 2.0 (the + ~ "License"); you may not use this file except in compliance + ~ with the License. You may obtain a copy of the License at + ~ + ~ http://www.apache.org/licenses/LICENSE-2.0 + ~ + ~ Unless required by applicable law or agreed to in writing, + ~ software distributed under the License is distributed on an + ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + ~ KIND, either express or implied. See the License for the + ~ specific language governing permissions and limitations + ~ under the License. +--> + +# RFC: File Format API for Apache Iceberg Rust + +## Background + +### Current state + +The `iceberg` crate (version 0.9.0, Rust 1.92 as of this writing) is the core library of the Apache Iceberg Rust project. Its module layout in `crates/iceberg/src/lib.rs` exposes public modules for `spec`, `arrow`, `writer`, `scan`, `io`, `expr`, `transaction`, and others. The crate depends directly on the `parquet` crate (with the `async` feature) and on the `arrow-*` crates. It has no feature flags today. + +For data file writing, the crate provides a three-layer architecture described in `crates/iceberg/src/writer/mod.rs`: + +1. **`FileWriter` / `FileWriterBuilder`** in `writer/file_writer/mod.rs`: traits for physical file writers, generic over an output type (defaulting to `Vec<DataFileBuilder>`). `FileWriterBuilder::build` takes an `OutputFile` and returns a `FileWriter`. `FileWriter::write` takes a `&RecordBatch`. + +2. **`IcebergWriter` / `IcebergWriterBuilder`** in `writer/mod.rs`: traits for logical Iceberg writers (data files, equality deletes, position deletes, partitioning), defaulting to `RecordBatch` input and `Vec<DataFile>` output. + +3. **Concrete implementations**: `ParquetWriterBuilder` and `ParquetWriter` in `writer/file_writer/parquet_writer.rs` are the only `FileWriter` implementation. Higher-level writers such as `DataFileWriterBuilder` (`writer/base_writer/data_file_writer.rs`) and `EqualityDeleteFileWriterBuilder` (`writer/base_writer/equality_delete_writer.rs`) are generic over any `FileWriterBuilder`, but every example, test, and integration instantiates them with `ParquetWriterBuilder`. + +The trait layer is format-agnostic. Every concrete instantiation uses Parquet. + +For data file reading, the crate provides `ArrowReaderBuilder` and `ArrowReader` in `arrow/reader.rs`. Both are Parquet-specific despite the generic name. `ArrowReader::process_file_scan_task` calls `open_parquet_file` directly, constructs a `ParquetRecordBatchReaderBuilder`, and uses Parquet-specific row group filtering and page index logic. `TableScan::to_arrow` in `scan/mod.rs` wires `ArrowReaderBuilder` in as the only reader path. `FileScanTask` carries a `data_file_format: DataFileFormat` field, but `process_file_scan_task` ignores it. `DataFileFormat` appears throughout `src/`. Almost every non-test reference is `DataFileFormat::Parquet`. The only non-Parquet references are two uses of `DataFileFormat::Avro` in `transaction/snapshot.rs` for manifest files. + +The `DataFileFormat` enum in `spec/manifest/data_file.rs` has four variants: `Avro`, `Orc`, `Parquet`, and `Puffin`. Puffin files hold statistics and deletion vectors rather than row data. They are handled in `crates/iceberg/src/puffin/` and are out of scope for this RFC (see Non-Goals). Data file support today: + +| Format | Data file read | Data file write | Manifests | +|---------|----------------|-----------------|-----------| +| Parquet | Yes | Yes | No | +| Avro | No | No | Yes | +| ORC | No | No | No | + +A table containing ORC or Avro data files cannot be read from Rust today, even though both are valid per the Iceberg spec. + +Arrow `RecordBatch` is the only in-memory data representation in the crate. It is the input type for every writer trait and the output type for every reader. The `arrow/` module provides schema conversion (`schema_to_arrow_schema`, `arrow_schema_to_schema`), `RecordBatchTransformer` for schema evolution and constant column injection, and the Parquet-to-Arrow read path. Every integration crate, including `iceberg-datafusion`, consumes Arrow. The crate does not define a generic `Record` type and does not integrate with engine-specific row types such as Java's `InternalRow` or `RowData`. + +### Pain points + +1. **No extension point for new formats.** Adding ORC means editing `ArrowReader` to branch on `DataFileFormat` and threading ORC-specific logic through every layer that touches it. The write path has the same shape. + +2. **Parquet assumptions leak into generic code.** `ArrowReaderBuilder` exposes Parquet-specific options (`with_metadata_size_hint`, `with_row_group_filtering_enabled`, `with_row_selection_enabled`) that are meaningless for ORC or Avro. The name "ArrowReader" conflates the in-memory representation (Arrow) with the on-disk format (Parquet). + +3. **No format-agnostic statistics.** `ParquetWriter` computes column sizes, value counts, and min/max bounds through `MinMaxColAggregator` and `NanValueCountVisitor`, both tightly coupled to Parquet's `Statistics` type. Another format cannot produce comparable manifest metadata without a shared statistics interface. + +4. **V3 types will need per-format serialization.** The V3 spec adds Variant and Geometry types. Each format encodes them differently: Parquet uses variant shredding, ORC uses binary, Avro uses union types. Implementing either type without a format abstraction means new `match` arms in every reader and writer code path. + +### Prior work + +The Java project shipped `FormatModel<D, S>` in February 2026 (PR [#12774](https://github.com/apache/iceberg/pull/12774)) after a 10-month review. Implementations for Parquet, Avro, ORC, and Arrow followed in PRs [#15253](https://github.com/apache/iceberg/pull/15253) through [#15258](https://github.com/apache/iceberg/pull/15258). Engine migrations for Spark ([#15328](https://github.com/apache/iceberg/pull/15328)) and Flink ([#15329](https://github.com/apache/iceberg/pull/15329)) landed shortly after. + +Java's `FormatModel` carries two generic parameters: a data type `D` (Java uses `Record`, Spark `InternalRow`, Flink `RowData`, and Arrow `ColumnarBatch`) and an engine schema type `S` (Spark `StructType`, Flink `RowType`, and others). A static `FormatModelRegistry` stores implementations keyed by `Pair<FileFormat, Class<?>>` and populates itself through Java reflection at startup. The old `Parquet.WriteBuilder`, `Avro.WriteBuilder`, and `ORC.WriteBuilder` are not deprecated. The new `FormatModel` implementations wrap them rather than replace them. + +PyIceberg has an open proposal for the equivalent capability in issue [apache/iceberg-python#3100](https://github.com/apache/iceberg-python/issues/3100), with an in-progress PR at [apache/iceberg-python#3119](https://github.com/apache/iceberg-python/pull/3119). Because PyIceberg uses PyArrow as its only in-memory representation, the proposal drops Java's generic type parameters and keys the registry on file format alone. Two prior PyIceberg ORC PRs ([#790](https://github.com/apache/iceberg-python/pull/790), [#2236](https://github.com/apache/iceberg-python/pull/2236)) were closed as stale without merging, which reinforces the case for landing an abstraction layer before adding new formats. + +Design decisions in this RFC that differ from Java (no generics, single-dimension registry key, hard cutover from the old Parquet types) are justified inline where they appear. Specific Java design points that shaped those decisions are called out in Alternatives Considered. + +## Goals + +The user-facing outcome of this proposal is that every Iceberg data file and delete file flows through the same stable and extensible API. Parquet is the first format to land. ORC, Avro, and others follow on the same interface. Every goal below serves that outcome. + +1. Define a `FormatModel` trait that encapsulates format-specific read and write behavior independent of the on-disk format. + +2. Remove hard-coded Parquet assumptions from scan and write orchestration. After this work, `TableScan::to_arrow` and `DataFileWriterBuilder` dispatch through the format abstraction instead of constructing Parquet types directly. + +3. Provide a registry that maps `DataFileFormat` values to `FormatModel` implementations, so callers obtain readers and writers without naming the concrete format type. + +4. Define a conformance test suite (TCK) that any `FormatModel` implementation must pass before it merges. Review Comment: Is this really necessary for the initial implementation? we only support parquet for now TCK is not even completed on the java side afaik ########## docs/rfcs/0003_file_format_api.md: ########## @@ -0,0 +1,520 @@ +<!-- + ~ Licensed to the Apache Software Foundation (ASF) under one + ~ or more contributor license agreements. See the NOTICE file + ~ distributed with this work for additional information + ~ regarding copyright ownership. The ASF licenses this file + ~ to you under the Apache License, Version 2.0 (the + ~ "License"); you may not use this file except in compliance + ~ with the License. You may obtain a copy of the License at + ~ + ~ http://www.apache.org/licenses/LICENSE-2.0 + ~ + ~ Unless required by applicable law or agreed to in writing, + ~ software distributed under the License is distributed on an + ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + ~ KIND, either express or implied. See the License for the + ~ specific language governing permissions and limitations + ~ under the License. +--> + +# RFC: File Format API for Apache Iceberg Rust + +## Background + +### Current state + +The `iceberg` crate (version 0.9.0, Rust 1.92 as of this writing) is the core library of the Apache Iceberg Rust project. Its module layout in `crates/iceberg/src/lib.rs` exposes public modules for `spec`, `arrow`, `writer`, `scan`, `io`, `expr`, `transaction`, and others. The crate depends directly on the `parquet` crate (with the `async` feature) and on the `arrow-*` crates. It has no feature flags today. + +For data file writing, the crate provides a three-layer architecture described in `crates/iceberg/src/writer/mod.rs`: + +1. **`FileWriter` / `FileWriterBuilder`** in `writer/file_writer/mod.rs`: traits for physical file writers, generic over an output type (defaulting to `Vec<DataFileBuilder>`). `FileWriterBuilder::build` takes an `OutputFile` and returns a `FileWriter`. `FileWriter::write` takes a `&RecordBatch`. + +2. **`IcebergWriter` / `IcebergWriterBuilder`** in `writer/mod.rs`: traits for logical Iceberg writers (data files, equality deletes, position deletes, partitioning), defaulting to `RecordBatch` input and `Vec<DataFile>` output. + +3. **Concrete implementations**: `ParquetWriterBuilder` and `ParquetWriter` in `writer/file_writer/parquet_writer.rs` are the only `FileWriter` implementation. Higher-level writers such as `DataFileWriterBuilder` (`writer/base_writer/data_file_writer.rs`) and `EqualityDeleteFileWriterBuilder` (`writer/base_writer/equality_delete_writer.rs`) are generic over any `FileWriterBuilder`, but every example, test, and integration instantiates them with `ParquetWriterBuilder`. + +The trait layer is format-agnostic. Every concrete instantiation uses Parquet. + +For data file reading, the crate provides `ArrowReaderBuilder` and `ArrowReader` in `arrow/reader.rs`. Both are Parquet-specific despite the generic name. `ArrowReader::process_file_scan_task` calls `open_parquet_file` directly, constructs a `ParquetRecordBatchReaderBuilder`, and uses Parquet-specific row group filtering and page index logic. `TableScan::to_arrow` in `scan/mod.rs` wires `ArrowReaderBuilder` in as the only reader path. `FileScanTask` carries a `data_file_format: DataFileFormat` field, but `process_file_scan_task` ignores it. `DataFileFormat` appears throughout `src/`. Almost every non-test reference is `DataFileFormat::Parquet`. The only non-Parquet references are two uses of `DataFileFormat::Avro` in `transaction/snapshot.rs` for manifest files. + +The `DataFileFormat` enum in `spec/manifest/data_file.rs` has four variants: `Avro`, `Orc`, `Parquet`, and `Puffin`. Puffin files hold statistics and deletion vectors rather than row data. They are handled in `crates/iceberg/src/puffin/` and are out of scope for this RFC (see Non-Goals). Data file support today: + +| Format | Data file read | Data file write | Manifests | +|---------|----------------|-----------------|-----------| +| Parquet | Yes | Yes | No | +| Avro | No | No | Yes | +| ORC | No | No | No | + +A table containing ORC or Avro data files cannot be read from Rust today, even though both are valid per the Iceberg spec. + +Arrow `RecordBatch` is the only in-memory data representation in the crate. It is the input type for every writer trait and the output type for every reader. The `arrow/` module provides schema conversion (`schema_to_arrow_schema`, `arrow_schema_to_schema`), `RecordBatchTransformer` for schema evolution and constant column injection, and the Parquet-to-Arrow read path. Every integration crate, including `iceberg-datafusion`, consumes Arrow. The crate does not define a generic `Record` type and does not integrate with engine-specific row types such as Java's `InternalRow` or `RowData`. + +### Pain points + +1. **No extension point for new formats.** Adding ORC means editing `ArrowReader` to branch on `DataFileFormat` and threading ORC-specific logic through every layer that touches it. The write path has the same shape. + +2. **Parquet assumptions leak into generic code.** `ArrowReaderBuilder` exposes Parquet-specific options (`with_metadata_size_hint`, `with_row_group_filtering_enabled`, `with_row_selection_enabled`) that are meaningless for ORC or Avro. The name "ArrowReader" conflates the in-memory representation (Arrow) with the on-disk format (Parquet). + +3. **No format-agnostic statistics.** `ParquetWriter` computes column sizes, value counts, and min/max bounds through `MinMaxColAggregator` and `NanValueCountVisitor`, both tightly coupled to Parquet's `Statistics` type. Another format cannot produce comparable manifest metadata without a shared statistics interface. + +4. **V3 types will need per-format serialization.** The V3 spec adds Variant and Geometry types. Each format encodes them differently: Parquet uses variant shredding, ORC uses binary, Avro uses union types. Implementing either type without a format abstraction means new `match` arms in every reader and writer code path. + +### Prior work + +The Java project shipped `FormatModel<D, S>` in February 2026 (PR [#12774](https://github.com/apache/iceberg/pull/12774)) after a 10-month review. Implementations for Parquet, Avro, ORC, and Arrow followed in PRs [#15253](https://github.com/apache/iceberg/pull/15253) through [#15258](https://github.com/apache/iceberg/pull/15258). Engine migrations for Spark ([#15328](https://github.com/apache/iceberg/pull/15328)) and Flink ([#15329](https://github.com/apache/iceberg/pull/15329)) landed shortly after. + +Java's `FormatModel` carries two generic parameters: a data type `D` (Java uses `Record`, Spark `InternalRow`, Flink `RowData`, and Arrow `ColumnarBatch`) and an engine schema type `S` (Spark `StructType`, Flink `RowType`, and others). A static `FormatModelRegistry` stores implementations keyed by `Pair<FileFormat, Class<?>>` and populates itself through Java reflection at startup. The old `Parquet.WriteBuilder`, `Avro.WriteBuilder`, and `ORC.WriteBuilder` are not deprecated. The new `FormatModel` implementations wrap them rather than replace them. + +PyIceberg has an open proposal for the equivalent capability in issue [apache/iceberg-python#3100](https://github.com/apache/iceberg-python/issues/3100), with an in-progress PR at [apache/iceberg-python#3119](https://github.com/apache/iceberg-python/pull/3119). Because PyIceberg uses PyArrow as its only in-memory representation, the proposal drops Java's generic type parameters and keys the registry on file format alone. Two prior PyIceberg ORC PRs ([#790](https://github.com/apache/iceberg-python/pull/790), [#2236](https://github.com/apache/iceberg-python/pull/2236)) were closed as stale without merging, which reinforces the case for landing an abstraction layer before adding new formats. + +Design decisions in this RFC that differ from Java (no generics, single-dimension registry key, hard cutover from the old Parquet types) are justified inline where they appear. Specific Java design points that shaped those decisions are called out in Alternatives Considered. + +## Goals + +The user-facing outcome of this proposal is that every Iceberg data file and delete file flows through the same stable and extensible API. Parquet is the first format to land. ORC, Avro, and others follow on the same interface. Every goal below serves that outcome. + +1. Define a `FormatModel` trait that encapsulates format-specific read and write behavior independent of the on-disk format. + +2. Remove hard-coded Parquet assumptions from scan and write orchestration. After this work, `TableScan::to_arrow` and `DataFileWriterBuilder` dispatch through the format abstraction instead of constructing Parquet types directly. + +3. Provide a registry that maps `DataFileFormat` values to `FormatModel` implementations, so callers obtain readers and writers without naming the concrete format type. + +4. Define a conformance test suite (TCK) that any `FormatModel` implementation must pass before it merges. + +5. Match the Java and PyIceberg designs where they align, and diverge where Rust's single-data-type ecosystem and pre-1.0 status justify it. Divergences are called out inline. + +## Non-Goals + +The items below are deliberately out of scope to keep this proposal focused on the abstraction and its Parquet implementation. Most are follow-up work that the API enables but does not itself deliver. + +1. **Ship new format implementations.** This RFC lands the abstraction and a Parquet implementation. ORC, Avro data-file, Vortex, and Lance are follow-up RFCs. + +2. **Introduce a plugin protocol or runtime library loading.** Rust does not offer a clean mechanism for loading compiled plugins at runtime. A runtime-linking approach using `libloading` or similar would expand scope beyond what the formats currently under discussion (Parquet, ORC, Avro, Vortex, Lance) require. + +3. **Add Puffin support to the FormatModel.** Puffin files hold statistics and deletion vectors rather than row data. They have a different lifecycle from data files and are already handled separately in `crates/iceberg/src/puffin/`. + +4. **Redesign the writer trait hierarchy.** The existing `IcebergWriter` and `FileWriter` layering is sound. This RFC adds a format abstraction beneath `FileWriter`, not a replacement for it. + +5. **Implement variant shredding or encryption.** Java exposes `engineProjection` and `engineSchema` as extension points for variant shredding and similar format-specific type mapping, and `withFileEncryptionKey` and `withAADPrefix` for Parquet encryption. Equivalent hooks are noted as future extensions in the Rust design. Implementing either requires a dedicated RFC. + +6. **Change the Iceberg table spec.** This proposal is a Rust-only API change. It does not modify the Iceberg spec, the manifest format, the manifest list format, or the on-disk layout of any file. + +7. **Modify manifest read or write paths.** Manifests and manifest lists remain in Avro and are handled by the existing `ManifestReader` and `ManifestWriter` paths. The File Format API is about data files and delete files only. + + +## Design + +The Rust API is three traits and a registry. `FormatModel` is the trait that each format implementation provides. `FormatReadBuilder` and `FormatWriteBuilder` are the per-operation configurators that `FormatModel` returns. `FormatRegistry` maps `DataFileFormat` values to `FormatModel` instances. None of the traits carry generic parameters. The subsections below introduce each type, and a final "Design rationale" subsection explains the choices. + +### The FormatModel trait + +```rust +pub trait FormatModel: Send + Sync + 'static { + fn format(&self) -> DataFileFormat; + fn read_builder(&self, input: InputFile) -> Box<dyn FormatReadBuilder>; + fn write_builder(&self, output: OutputFile) -> Box<dyn FormatWriteBuilder>; +} +``` + +Each implementation registers one instance per `DataFileFormat` variant it supports. The `format` method returns that variant. `read_builder` and `write_builder` are the entry points for reading and writing a file. Both return trait objects so that the registry can hand them back from a `DataFileFormat`-keyed lookup. + +The data type is fixed to Arrow `RecordBatch`, the Iceberg schema type to `iceberg::spec::Schema`, and the physical schema type to `arrow::datatypes::SchemaRef`. These are the only types in iceberg-rust today that would fill the roles Java uses generic parameters for. Arguments for keeping them fixed, including comparisons with Java's `FormatModel<D, S>`, are in "Design rationale" below. Review Comment: We don't have to hardcode it to use Arrow's `RecordBatch` even. We can use a generic type for in-memory representation and arrow can be the default value ########## docs/rfcs/0003_file_format_api.md: ########## @@ -0,0 +1,520 @@ +<!-- + ~ Licensed to the Apache Software Foundation (ASF) under one + ~ or more contributor license agreements. See the NOTICE file + ~ distributed with this work for additional information + ~ regarding copyright ownership. The ASF licenses this file + ~ to you under the Apache License, Version 2.0 (the + ~ "License"); you may not use this file except in compliance + ~ with the License. You may obtain a copy of the License at + ~ + ~ http://www.apache.org/licenses/LICENSE-2.0 + ~ + ~ Unless required by applicable law or agreed to in writing, + ~ software distributed under the License is distributed on an + ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + ~ KIND, either express or implied. See the License for the + ~ specific language governing permissions and limitations + ~ under the License. +--> + +# RFC: File Format API for Apache Iceberg Rust + +## Background + +### Current state + +The `iceberg` crate (version 0.9.0, Rust 1.92 as of this writing) is the core library of the Apache Iceberg Rust project. Its module layout in `crates/iceberg/src/lib.rs` exposes public modules for `spec`, `arrow`, `writer`, `scan`, `io`, `expr`, `transaction`, and others. The crate depends directly on the `parquet` crate (with the `async` feature) and on the `arrow-*` crates. It has no feature flags today. + +For data file writing, the crate provides a three-layer architecture described in `crates/iceberg/src/writer/mod.rs`: + +1. **`FileWriter` / `FileWriterBuilder`** in `writer/file_writer/mod.rs`: traits for physical file writers, generic over an output type (defaulting to `Vec<DataFileBuilder>`). `FileWriterBuilder::build` takes an `OutputFile` and returns a `FileWriter`. `FileWriter::write` takes a `&RecordBatch`. + +2. **`IcebergWriter` / `IcebergWriterBuilder`** in `writer/mod.rs`: traits for logical Iceberg writers (data files, equality deletes, position deletes, partitioning), defaulting to `RecordBatch` input and `Vec<DataFile>` output. + +3. **Concrete implementations**: `ParquetWriterBuilder` and `ParquetWriter` in `writer/file_writer/parquet_writer.rs` are the only `FileWriter` implementation. Higher-level writers such as `DataFileWriterBuilder` (`writer/base_writer/data_file_writer.rs`) and `EqualityDeleteFileWriterBuilder` (`writer/base_writer/equality_delete_writer.rs`) are generic over any `FileWriterBuilder`, but every example, test, and integration instantiates them with `ParquetWriterBuilder`. + +The trait layer is format-agnostic. Every concrete instantiation uses Parquet. + +For data file reading, the crate provides `ArrowReaderBuilder` and `ArrowReader` in `arrow/reader.rs`. Both are Parquet-specific despite the generic name. `ArrowReader::process_file_scan_task` calls `open_parquet_file` directly, constructs a `ParquetRecordBatchReaderBuilder`, and uses Parquet-specific row group filtering and page index logic. `TableScan::to_arrow` in `scan/mod.rs` wires `ArrowReaderBuilder` in as the only reader path. `FileScanTask` carries a `data_file_format: DataFileFormat` field, but `process_file_scan_task` ignores it. `DataFileFormat` appears throughout `src/`. Almost every non-test reference is `DataFileFormat::Parquet`. The only non-Parquet references are two uses of `DataFileFormat::Avro` in `transaction/snapshot.rs` for manifest files. + +The `DataFileFormat` enum in `spec/manifest/data_file.rs` has four variants: `Avro`, `Orc`, `Parquet`, and `Puffin`. Puffin files hold statistics and deletion vectors rather than row data. They are handled in `crates/iceberg/src/puffin/` and are out of scope for this RFC (see Non-Goals). Data file support today: + +| Format | Data file read | Data file write | Manifests | +|---------|----------------|-----------------|-----------| +| Parquet | Yes | Yes | No | +| Avro | No | No | Yes | +| ORC | No | No | No | + +A table containing ORC or Avro data files cannot be read from Rust today, even though both are valid per the Iceberg spec. + +Arrow `RecordBatch` is the only in-memory data representation in the crate. It is the input type for every writer trait and the output type for every reader. The `arrow/` module provides schema conversion (`schema_to_arrow_schema`, `arrow_schema_to_schema`), `RecordBatchTransformer` for schema evolution and constant column injection, and the Parquet-to-Arrow read path. Every integration crate, including `iceberg-datafusion`, consumes Arrow. The crate does not define a generic `Record` type and does not integrate with engine-specific row types such as Java's `InternalRow` or `RowData`. + +### Pain points + +1. **No extension point for new formats.** Adding ORC means editing `ArrowReader` to branch on `DataFileFormat` and threading ORC-specific logic through every layer that touches it. The write path has the same shape. + +2. **Parquet assumptions leak into generic code.** `ArrowReaderBuilder` exposes Parquet-specific options (`with_metadata_size_hint`, `with_row_group_filtering_enabled`, `with_row_selection_enabled`) that are meaningless for ORC or Avro. The name "ArrowReader" conflates the in-memory representation (Arrow) with the on-disk format (Parquet). + +3. **No format-agnostic statistics.** `ParquetWriter` computes column sizes, value counts, and min/max bounds through `MinMaxColAggregator` and `NanValueCountVisitor`, both tightly coupled to Parquet's `Statistics` type. Another format cannot produce comparable manifest metadata without a shared statistics interface. + +4. **V3 types will need per-format serialization.** The V3 spec adds Variant and Geometry types. Each format encodes them differently: Parquet uses variant shredding, ORC uses binary, Avro uses union types. Implementing either type without a format abstraction means new `match` arms in every reader and writer code path. + +### Prior work + +The Java project shipped `FormatModel<D, S>` in February 2026 (PR [#12774](https://github.com/apache/iceberg/pull/12774)) after a 10-month review. Implementations for Parquet, Avro, ORC, and Arrow followed in PRs [#15253](https://github.com/apache/iceberg/pull/15253) through [#15258](https://github.com/apache/iceberg/pull/15258). Engine migrations for Spark ([#15328](https://github.com/apache/iceberg/pull/15328)) and Flink ([#15329](https://github.com/apache/iceberg/pull/15329)) landed shortly after. + +Java's `FormatModel` carries two generic parameters: a data type `D` (Java uses `Record`, Spark `InternalRow`, Flink `RowData`, and Arrow `ColumnarBatch`) and an engine schema type `S` (Spark `StructType`, Flink `RowType`, and others). A static `FormatModelRegistry` stores implementations keyed by `Pair<FileFormat, Class<?>>` and populates itself through Java reflection at startup. The old `Parquet.WriteBuilder`, `Avro.WriteBuilder`, and `ORC.WriteBuilder` are not deprecated. The new `FormatModel` implementations wrap them rather than replace them. + +PyIceberg has an open proposal for the equivalent capability in issue [apache/iceberg-python#3100](https://github.com/apache/iceberg-python/issues/3100), with an in-progress PR at [apache/iceberg-python#3119](https://github.com/apache/iceberg-python/pull/3119). Because PyIceberg uses PyArrow as its only in-memory representation, the proposal drops Java's generic type parameters and keys the registry on file format alone. Two prior PyIceberg ORC PRs ([#790](https://github.com/apache/iceberg-python/pull/790), [#2236](https://github.com/apache/iceberg-python/pull/2236)) were closed as stale without merging, which reinforces the case for landing an abstraction layer before adding new formats. + +Design decisions in this RFC that differ from Java (no generics, single-dimension registry key, hard cutover from the old Parquet types) are justified inline where they appear. Specific Java design points that shaped those decisions are called out in Alternatives Considered. + +## Goals + +The user-facing outcome of this proposal is that every Iceberg data file and delete file flows through the same stable and extensible API. Parquet is the first format to land. ORC, Avro, and others follow on the same interface. Every goal below serves that outcome. + +1. Define a `FormatModel` trait that encapsulates format-specific read and write behavior independent of the on-disk format. + +2. Remove hard-coded Parquet assumptions from scan and write orchestration. After this work, `TableScan::to_arrow` and `DataFileWriterBuilder` dispatch through the format abstraction instead of constructing Parquet types directly. + +3. Provide a registry that maps `DataFileFormat` values to `FormatModel` implementations, so callers obtain readers and writers without naming the concrete format type. + +4. Define a conformance test suite (TCK) that any `FormatModel` implementation must pass before it merges. + +5. Match the Java and PyIceberg designs where they align, and diverge where Rust's single-data-type ecosystem and pre-1.0 status justify it. Divergences are called out inline. + +## Non-Goals + +The items below are deliberately out of scope to keep this proposal focused on the abstraction and its Parquet implementation. Most are follow-up work that the API enables but does not itself deliver. + +1. **Ship new format implementations.** This RFC lands the abstraction and a Parquet implementation. ORC, Avro data-file, Vortex, and Lance are follow-up RFCs. + +2. **Introduce a plugin protocol or runtime library loading.** Rust does not offer a clean mechanism for loading compiled plugins at runtime. A runtime-linking approach using `libloading` or similar would expand scope beyond what the formats currently under discussion (Parquet, ORC, Avro, Vortex, Lance) require. + +3. **Add Puffin support to the FormatModel.** Puffin files hold statistics and deletion vectors rather than row data. They have a different lifecycle from data files and are already handled separately in `crates/iceberg/src/puffin/`. + +4. **Redesign the writer trait hierarchy.** The existing `IcebergWriter` and `FileWriter` layering is sound. This RFC adds a format abstraction beneath `FileWriter`, not a replacement for it. + +5. **Implement variant shredding or encryption.** Java exposes `engineProjection` and `engineSchema` as extension points for variant shredding and similar format-specific type mapping, and `withFileEncryptionKey` and `withAADPrefix` for Parquet encryption. Equivalent hooks are noted as future extensions in the Rust design. Implementing either requires a dedicated RFC. + +6. **Change the Iceberg table spec.** This proposal is a Rust-only API change. It does not modify the Iceberg spec, the manifest format, the manifest list format, or the on-disk layout of any file. + +7. **Modify manifest read or write paths.** Manifests and manifest lists remain in Avro and are handled by the existing `ManifestReader` and `ManifestWriter` paths. The File Format API is about data files and delete files only. + + +## Design + +The Rust API is three traits and a registry. `FormatModel` is the trait that each format implementation provides. `FormatReadBuilder` and `FormatWriteBuilder` are the per-operation configurators that `FormatModel` returns. `FormatRegistry` maps `DataFileFormat` values to `FormatModel` instances. None of the traits carry generic parameters. The subsections below introduce each type, and a final "Design rationale" subsection explains the choices. + +### The FormatModel trait + +```rust +pub trait FormatModel: Send + Sync + 'static { + fn format(&self) -> DataFileFormat; + fn read_builder(&self, input: InputFile) -> Box<dyn FormatReadBuilder>; + fn write_builder(&self, output: OutputFile) -> Box<dyn FormatWriteBuilder>; +} +``` + +Each implementation registers one instance per `DataFileFormat` variant it supports. The `format` method returns that variant. `read_builder` and `write_builder` are the entry points for reading and writing a file. Both return trait objects so that the registry can hand them back from a `DataFileFormat`-keyed lookup. + +The data type is fixed to Arrow `RecordBatch`, the Iceberg schema type to `iceberg::spec::Schema`, and the physical schema type to `arrow::datatypes::SchemaRef`. These are the only types in iceberg-rust today that would fill the roles Java uses generic parameters for. Arguments for keeping them fixed, including comparisons with Java's `FormatModel<D, S>`, are in "Design rationale" below. + +### The read and write builders + +`FormatReadBuilder` and `FormatWriteBuilder` configure a single read or write. `FormatModel` produces them, and the caller consumes them with `build`. + +```rust +pub trait FormatReadBuilder: Send { + fn project(&mut self, schema: Schema) -> &mut Self; + fn filter(&mut self, predicate: BoundPredicate) -> &mut Self; + fn split(&mut self, start: u64, length: u64) -> &mut Self; + fn case_sensitive(&mut self, case_sensitive: bool) -> &mut Self; + fn batch_size(&mut self, batch_size: usize) -> &mut Self; + fn build(self: Box<Self>) -> BoxFuture<'static, Result<ArrowRecordBatchStream>>; +} + +pub trait FormatWriteBuilder: Send { + fn schema(&mut self, schema: Schema) -> &mut Self; + fn set(&mut self, key: String, value: String) -> &mut Self; + fn build(self: Box<Self>) -> BoxFuture<'static, Result<Box<dyn FormatFileWriter>>>; +} +``` + +Both builders take Iceberg `Schema` values. Format implementations convert to physical schemas internally using `schema_to_arrow_schema` from `arrow/schema.rs`. This deliberately departs from Java's `ReadBuilder<D, S>`, which has a separate `engineProjection(S)` method for variant shredding and similar per-engine type mapping. Rust's builders expose one projection surface, and the hook for a variant-shredding "engine projection" is a future-extension point described in "Design rationale." + +`FormatWriteBuilder::build` produces a `FormatFileWriter`: + +```rust +pub trait FormatFileWriter: Send { + fn write(&mut self, batch: &RecordBatch) -> BoxFuture<'_, Result<()>>; + fn close(self: Box<Self>) -> BoxFuture<'static, Result<Vec<DataFileBuilder>>>; +} +``` + +The async methods return `BoxFuture` rather than using `async fn` in traits. The `self: Box<Self>` signature on `build` and `close` lets those methods consume the value while keeping the traits object-safe. Both patterns are forced by the trait-object boundary at the registry. The "BoxFuture instead of async fn in traits" and "Dynamic dispatch at the registry, static inside the format" subsections under "Design rationale" explain why. + +### The FormatRegistry + +`FormatRegistry` maps `DataFileFormat` values to `FormatModel` instances. + +```rust +pub struct FormatRegistry { + models: HashMap<DataFileFormat, Box<dyn FormatModel>>, +} + +impl FormatRegistry { + pub fn new() -> Self { ... } + pub fn register(&mut self, model: Box<dyn FormatModel>) { ... } + pub fn read_builder( + &self, + format: DataFileFormat, + input: InputFile, + ) -> Result<Box<dyn FormatReadBuilder>> { ... } + pub fn write_builder( + &self, + format: DataFileFormat, + output: OutputFile, + ) -> Result<Box<dyn FormatWriteBuilder>> { ... } +} +``` + +Using `DataFileFormat` as a `HashMap` key requires adding `#[derive(Hash)]` to the enum. That is a non-breaking addition. + +The registry is an owned value, not a global static. Tests construct their own. Applications construct one at startup and pass it to scan planners and write orchestrators. For the common case of a single registry for the lifetime of a process, `default_format_registry()` returns a `&'static FormatRegistry` initialized through `OnceLock` on first call. + +`read_builder` and `write_builder` return `Err(Error { kind: ErrorKind::FeatureUnsupported, .. })` for unregistered formats. The error message distinguishes two cases: the format is implemented but its feature flag is disabled in this build, or the format has no implementation in this crate. + +### Feature flags Review Comment: This could be a non-goal, we only plan to support parquet for now and parquet is essential ########## docs/rfcs/0003_file_format_api.md: ########## @@ -0,0 +1,520 @@ +<!-- + ~ Licensed to the Apache Software Foundation (ASF) under one + ~ or more contributor license agreements. See the NOTICE file + ~ distributed with this work for additional information + ~ regarding copyright ownership. The ASF licenses this file + ~ to you under the Apache License, Version 2.0 (the + ~ "License"); you may not use this file except in compliance + ~ with the License. You may obtain a copy of the License at + ~ + ~ http://www.apache.org/licenses/LICENSE-2.0 + ~ + ~ Unless required by applicable law or agreed to in writing, + ~ software distributed under the License is distributed on an + ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + ~ KIND, either express or implied. See the License for the + ~ specific language governing permissions and limitations + ~ under the License. +--> + +# RFC: File Format API for Apache Iceberg Rust + +## Background + +### Current state + +The `iceberg` crate (version 0.9.0, Rust 1.92 as of this writing) is the core library of the Apache Iceberg Rust project. Its module layout in `crates/iceberg/src/lib.rs` exposes public modules for `spec`, `arrow`, `writer`, `scan`, `io`, `expr`, `transaction`, and others. The crate depends directly on the `parquet` crate (with the `async` feature) and on the `arrow-*` crates. It has no feature flags today. + +For data file writing, the crate provides a three-layer architecture described in `crates/iceberg/src/writer/mod.rs`: + +1. **`FileWriter` / `FileWriterBuilder`** in `writer/file_writer/mod.rs`: traits for physical file writers, generic over an output type (defaulting to `Vec<DataFileBuilder>`). `FileWriterBuilder::build` takes an `OutputFile` and returns a `FileWriter`. `FileWriter::write` takes a `&RecordBatch`. + +2. **`IcebergWriter` / `IcebergWriterBuilder`** in `writer/mod.rs`: traits for logical Iceberg writers (data files, equality deletes, position deletes, partitioning), defaulting to `RecordBatch` input and `Vec<DataFile>` output. + +3. **Concrete implementations**: `ParquetWriterBuilder` and `ParquetWriter` in `writer/file_writer/parquet_writer.rs` are the only `FileWriter` implementation. Higher-level writers such as `DataFileWriterBuilder` (`writer/base_writer/data_file_writer.rs`) and `EqualityDeleteFileWriterBuilder` (`writer/base_writer/equality_delete_writer.rs`) are generic over any `FileWriterBuilder`, but every example, test, and integration instantiates them with `ParquetWriterBuilder`. + +The trait layer is format-agnostic. Every concrete instantiation uses Parquet. + +For data file reading, the crate provides `ArrowReaderBuilder` and `ArrowReader` in `arrow/reader.rs`. Both are Parquet-specific despite the generic name. `ArrowReader::process_file_scan_task` calls `open_parquet_file` directly, constructs a `ParquetRecordBatchReaderBuilder`, and uses Parquet-specific row group filtering and page index logic. `TableScan::to_arrow` in `scan/mod.rs` wires `ArrowReaderBuilder` in as the only reader path. `FileScanTask` carries a `data_file_format: DataFileFormat` field, but `process_file_scan_task` ignores it. `DataFileFormat` appears throughout `src/`. Almost every non-test reference is `DataFileFormat::Parquet`. The only non-Parquet references are two uses of `DataFileFormat::Avro` in `transaction/snapshot.rs` for manifest files. + +The `DataFileFormat` enum in `spec/manifest/data_file.rs` has four variants: `Avro`, `Orc`, `Parquet`, and `Puffin`. Puffin files hold statistics and deletion vectors rather than row data. They are handled in `crates/iceberg/src/puffin/` and are out of scope for this RFC (see Non-Goals). Data file support today: + +| Format | Data file read | Data file write | Manifests | +|---------|----------------|-----------------|-----------| +| Parquet | Yes | Yes | No | +| Avro | No | No | Yes | +| ORC | No | No | No | + +A table containing ORC or Avro data files cannot be read from Rust today, even though both are valid per the Iceberg spec. + +Arrow `RecordBatch` is the only in-memory data representation in the crate. It is the input type for every writer trait and the output type for every reader. The `arrow/` module provides schema conversion (`schema_to_arrow_schema`, `arrow_schema_to_schema`), `RecordBatchTransformer` for schema evolution and constant column injection, and the Parquet-to-Arrow read path. Every integration crate, including `iceberg-datafusion`, consumes Arrow. The crate does not define a generic `Record` type and does not integrate with engine-specific row types such as Java's `InternalRow` or `RowData`. + +### Pain points + +1. **No extension point for new formats.** Adding ORC means editing `ArrowReader` to branch on `DataFileFormat` and threading ORC-specific logic through every layer that touches it. The write path has the same shape. + +2. **Parquet assumptions leak into generic code.** `ArrowReaderBuilder` exposes Parquet-specific options (`with_metadata_size_hint`, `with_row_group_filtering_enabled`, `with_row_selection_enabled`) that are meaningless for ORC or Avro. The name "ArrowReader" conflates the in-memory representation (Arrow) with the on-disk format (Parquet). + +3. **No format-agnostic statistics.** `ParquetWriter` computes column sizes, value counts, and min/max bounds through `MinMaxColAggregator` and `NanValueCountVisitor`, both tightly coupled to Parquet's `Statistics` type. Another format cannot produce comparable manifest metadata without a shared statistics interface. + +4. **V3 types will need per-format serialization.** The V3 spec adds Variant and Geometry types. Each format encodes them differently: Parquet uses variant shredding, ORC uses binary, Avro uses union types. Implementing either type without a format abstraction means new `match` arms in every reader and writer code path. + +### Prior work + +The Java project shipped `FormatModel<D, S>` in February 2026 (PR [#12774](https://github.com/apache/iceberg/pull/12774)) after a 10-month review. Implementations for Parquet, Avro, ORC, and Arrow followed in PRs [#15253](https://github.com/apache/iceberg/pull/15253) through [#15258](https://github.com/apache/iceberg/pull/15258). Engine migrations for Spark ([#15328](https://github.com/apache/iceberg/pull/15328)) and Flink ([#15329](https://github.com/apache/iceberg/pull/15329)) landed shortly after. + +Java's `FormatModel` carries two generic parameters: a data type `D` (Java uses `Record`, Spark `InternalRow`, Flink `RowData`, and Arrow `ColumnarBatch`) and an engine schema type `S` (Spark `StructType`, Flink `RowType`, and others). A static `FormatModelRegistry` stores implementations keyed by `Pair<FileFormat, Class<?>>` and populates itself through Java reflection at startup. The old `Parquet.WriteBuilder`, `Avro.WriteBuilder`, and `ORC.WriteBuilder` are not deprecated. The new `FormatModel` implementations wrap them rather than replace them. + +PyIceberg has an open proposal for the equivalent capability in issue [apache/iceberg-python#3100](https://github.com/apache/iceberg-python/issues/3100), with an in-progress PR at [apache/iceberg-python#3119](https://github.com/apache/iceberg-python/pull/3119). Because PyIceberg uses PyArrow as its only in-memory representation, the proposal drops Java's generic type parameters and keys the registry on file format alone. Two prior PyIceberg ORC PRs ([#790](https://github.com/apache/iceberg-python/pull/790), [#2236](https://github.com/apache/iceberg-python/pull/2236)) were closed as stale without merging, which reinforces the case for landing an abstraction layer before adding new formats. + +Design decisions in this RFC that differ from Java (no generics, single-dimension registry key, hard cutover from the old Parquet types) are justified inline where they appear. Specific Java design points that shaped those decisions are called out in Alternatives Considered. + +## Goals + +The user-facing outcome of this proposal is that every Iceberg data file and delete file flows through the same stable and extensible API. Parquet is the first format to land. ORC, Avro, and others follow on the same interface. Every goal below serves that outcome. + +1. Define a `FormatModel` trait that encapsulates format-specific read and write behavior independent of the on-disk format. + +2. Remove hard-coded Parquet assumptions from scan and write orchestration. After this work, `TableScan::to_arrow` and `DataFileWriterBuilder` dispatch through the format abstraction instead of constructing Parquet types directly. + +3. Provide a registry that maps `DataFileFormat` values to `FormatModel` implementations, so callers obtain readers and writers without naming the concrete format type. + +4. Define a conformance test suite (TCK) that any `FormatModel` implementation must pass before it merges. + +5. Match the Java and PyIceberg designs where they align, and diverge where Rust's single-data-type ecosystem and pre-1.0 status justify it. Divergences are called out inline. + +## Non-Goals + +The items below are deliberately out of scope to keep this proposal focused on the abstraction and its Parquet implementation. Most are follow-up work that the API enables but does not itself deliver. + +1. **Ship new format implementations.** This RFC lands the abstraction and a Parquet implementation. ORC, Avro data-file, Vortex, and Lance are follow-up RFCs. + +2. **Introduce a plugin protocol or runtime library loading.** Rust does not offer a clean mechanism for loading compiled plugins at runtime. A runtime-linking approach using `libloading` or similar would expand scope beyond what the formats currently under discussion (Parquet, ORC, Avro, Vortex, Lance) require. + +3. **Add Puffin support to the FormatModel.** Puffin files hold statistics and deletion vectors rather than row data. They have a different lifecycle from data files and are already handled separately in `crates/iceberg/src/puffin/`. + +4. **Redesign the writer trait hierarchy.** The existing `IcebergWriter` and `FileWriter` layering is sound. This RFC adds a format abstraction beneath `FileWriter`, not a replacement for it. + +5. **Implement variant shredding or encryption.** Java exposes `engineProjection` and `engineSchema` as extension points for variant shredding and similar format-specific type mapping, and `withFileEncryptionKey` and `withAADPrefix` for Parquet encryption. Equivalent hooks are noted as future extensions in the Rust design. Implementing either requires a dedicated RFC. + +6. **Change the Iceberg table spec.** This proposal is a Rust-only API change. It does not modify the Iceberg spec, the manifest format, the manifest list format, or the on-disk layout of any file. + +7. **Modify manifest read or write paths.** Manifests and manifest lists remain in Avro and are handled by the existing `ManifestReader` and `ManifestWriter` paths. The File Format API is about data files and delete files only. + + +## Design + +The Rust API is three traits and a registry. `FormatModel` is the trait that each format implementation provides. `FormatReadBuilder` and `FormatWriteBuilder` are the per-operation configurators that `FormatModel` returns. `FormatRegistry` maps `DataFileFormat` values to `FormatModel` instances. None of the traits carry generic parameters. The subsections below introduce each type, and a final "Design rationale" subsection explains the choices. + +### The FormatModel trait + +```rust +pub trait FormatModel: Send + Sync + 'static { + fn format(&self) -> DataFileFormat; + fn read_builder(&self, input: InputFile) -> Box<dyn FormatReadBuilder>; + fn write_builder(&self, output: OutputFile) -> Box<dyn FormatWriteBuilder>; +} +``` + +Each implementation registers one instance per `DataFileFormat` variant it supports. The `format` method returns that variant. `read_builder` and `write_builder` are the entry points for reading and writing a file. Both return trait objects so that the registry can hand them back from a `DataFileFormat`-keyed lookup. + +The data type is fixed to Arrow `RecordBatch`, the Iceberg schema type to `iceberg::spec::Schema`, and the physical schema type to `arrow::datatypes::SchemaRef`. These are the only types in iceberg-rust today that would fill the roles Java uses generic parameters for. Arguments for keeping them fixed, including comparisons with Java's `FormatModel<D, S>`, are in "Design rationale" below. + +### The read and write builders + +`FormatReadBuilder` and `FormatWriteBuilder` configure a single read or write. `FormatModel` produces them, and the caller consumes them with `build`. + +```rust +pub trait FormatReadBuilder: Send { + fn project(&mut self, schema: Schema) -> &mut Self; + fn filter(&mut self, predicate: BoundPredicate) -> &mut Self; + fn split(&mut self, start: u64, length: u64) -> &mut Self; + fn case_sensitive(&mut self, case_sensitive: bool) -> &mut Self; + fn batch_size(&mut self, batch_size: usize) -> &mut Self; + fn build(self: Box<Self>) -> BoxFuture<'static, Result<ArrowRecordBatchStream>>; +} + +pub trait FormatWriteBuilder: Send { + fn schema(&mut self, schema: Schema) -> &mut Self; + fn set(&mut self, key: String, value: String) -> &mut Self; + fn build(self: Box<Self>) -> BoxFuture<'static, Result<Box<dyn FormatFileWriter>>>; +} +``` + +Both builders take Iceberg `Schema` values. Format implementations convert to physical schemas internally using `schema_to_arrow_schema` from `arrow/schema.rs`. This deliberately departs from Java's `ReadBuilder<D, S>`, which has a separate `engineProjection(S)` method for variant shredding and similar per-engine type mapping. Rust's builders expose one projection surface, and the hook for a variant-shredding "engine projection" is a future-extension point described in "Design rationale." + +`FormatWriteBuilder::build` produces a `FormatFileWriter`: + +```rust +pub trait FormatFileWriter: Send { + fn write(&mut self, batch: &RecordBatch) -> BoxFuture<'_, Result<()>>; + fn close(self: Box<Self>) -> BoxFuture<'static, Result<Vec<DataFileBuilder>>>; +} +``` + +The async methods return `BoxFuture` rather than using `async fn` in traits. The `self: Box<Self>` signature on `build` and `close` lets those methods consume the value while keeping the traits object-safe. Both patterns are forced by the trait-object boundary at the registry. The "BoxFuture instead of async fn in traits" and "Dynamic dispatch at the registry, static inside the format" subsections under "Design rationale" explain why. + +### The FormatRegistry + +`FormatRegistry` maps `DataFileFormat` values to `FormatModel` instances. + +```rust +pub struct FormatRegistry { + models: HashMap<DataFileFormat, Box<dyn FormatModel>>, +} + +impl FormatRegistry { + pub fn new() -> Self { ... } + pub fn register(&mut self, model: Box<dyn FormatModel>) { ... } + pub fn read_builder( + &self, + format: DataFileFormat, + input: InputFile, + ) -> Result<Box<dyn FormatReadBuilder>> { ... } + pub fn write_builder( + &self, + format: DataFileFormat, + output: OutputFile, + ) -> Result<Box<dyn FormatWriteBuilder>> { ... } +} +``` + +Using `DataFileFormat` as a `HashMap` key requires adding `#[derive(Hash)]` to the enum. That is a non-breaking addition. + +The registry is an owned value, not a global static. Tests construct their own. Applications construct one at startup and pass it to scan planners and write orchestrators. For the common case of a single registry for the lifetime of a process, `default_format_registry()` returns a `&'static FormatRegistry` initialized through `OnceLock` on first call. + +`read_builder` and `write_builder` return `Err(Error { kind: ErrorKind::FeatureUnsupported, .. })` for unregistered formats. The error message distinguishes two cases: the format is implemented but its feature flag is disabled in this build, or the format has no implementation in this crate. + +### Feature flags + +Format implementations live in `iceberg::formats::{format}` and are gated behind a feature flag per format: `format-parquet` on by default, `format-orc` when an ORC implementation lands, and so on. The default feature set includes every format the crate implements. Users who build from source and want a smaller binary disable what they do not need. + +`FormatRegistry::default()` registers every format enabled by the current feature set at compile time: + +```rust +impl Default for FormatRegistry { + fn default() -> Self { + let mut registry = Self::new(); + #[cfg(feature = "format-parquet")] + registry.register(Box::new(ParquetFormatModel::new())); + #[cfg(feature = "format-orc")] + registry.register(Box::new(OrcFormatModel::new())); + registry + } +} +``` + +Format implementations live inside the `iceberg` crate rather than as separate dependencies. This matches how the Java project keeps its `Parquet`, `Avro`, and `ORC` modules inside the `apache/iceberg` repository. In-tree implementations let the community control each format's quality and release cadence and keep the build graph narrow for downstream consumers. Nothing in the trait design prevents a downstream crate from defining its own `FormatModel` and registering it with a custom `FormatRegistry`, but in-tree is the recommended path for formats intended to ship in iceberg-rust itself. + +The `parquet` crate remains an unconditional dependency of `iceberg` for now, because non-format code (page index evaluators, row group metric evaluators, delete file loaders) still uses `parquet` types directly. A later pass gates the `parquet` dependency on the `format-parquet` feature once those callers move to format-agnostic abstractions. + +### Module layout + +``` +crates/iceberg/src/formats/ +├── mod.rs # FormatModel, FormatReadBuilder, FormatWriteBuilder, FormatFileWriter +├── registry.rs # FormatRegistry, default_format_registry +└── parquet.rs # ParquetFormatModel, wrapping existing ParquetWriter and ArrowReader Review Comment: nit: This should be formats/parquet/mod.rs? ########## docs/rfcs/0003_file_format_api.md: ########## @@ -0,0 +1,520 @@ +<!-- + ~ Licensed to the Apache Software Foundation (ASF) under one + ~ or more contributor license agreements. See the NOTICE file + ~ distributed with this work for additional information + ~ regarding copyright ownership. The ASF licenses this file + ~ to you under the Apache License, Version 2.0 (the + ~ "License"); you may not use this file except in compliance + ~ with the License. You may obtain a copy of the License at + ~ + ~ http://www.apache.org/licenses/LICENSE-2.0 + ~ + ~ Unless required by applicable law or agreed to in writing, + ~ software distributed under the License is distributed on an + ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + ~ KIND, either express or implied. See the License for the + ~ specific language governing permissions and limitations + ~ under the License. +--> + +# RFC: File Format API for Apache Iceberg Rust + +## Background + +### Current state + +The `iceberg` crate (version 0.9.0, Rust 1.92 as of this writing) is the core library of the Apache Iceberg Rust project. Its module layout in `crates/iceberg/src/lib.rs` exposes public modules for `spec`, `arrow`, `writer`, `scan`, `io`, `expr`, `transaction`, and others. The crate depends directly on the `parquet` crate (with the `async` feature) and on the `arrow-*` crates. It has no feature flags today. + +For data file writing, the crate provides a three-layer architecture described in `crates/iceberg/src/writer/mod.rs`: + +1. **`FileWriter` / `FileWriterBuilder`** in `writer/file_writer/mod.rs`: traits for physical file writers, generic over an output type (defaulting to `Vec<DataFileBuilder>`). `FileWriterBuilder::build` takes an `OutputFile` and returns a `FileWriter`. `FileWriter::write` takes a `&RecordBatch`. + +2. **`IcebergWriter` / `IcebergWriterBuilder`** in `writer/mod.rs`: traits for logical Iceberg writers (data files, equality deletes, position deletes, partitioning), defaulting to `RecordBatch` input and `Vec<DataFile>` output. + +3. **Concrete implementations**: `ParquetWriterBuilder` and `ParquetWriter` in `writer/file_writer/parquet_writer.rs` are the only `FileWriter` implementation. Higher-level writers such as `DataFileWriterBuilder` (`writer/base_writer/data_file_writer.rs`) and `EqualityDeleteFileWriterBuilder` (`writer/base_writer/equality_delete_writer.rs`) are generic over any `FileWriterBuilder`, but every example, test, and integration instantiates them with `ParquetWriterBuilder`. + +The trait layer is format-agnostic. Every concrete instantiation uses Parquet. + +For data file reading, the crate provides `ArrowReaderBuilder` and `ArrowReader` in `arrow/reader.rs`. Both are Parquet-specific despite the generic name. `ArrowReader::process_file_scan_task` calls `open_parquet_file` directly, constructs a `ParquetRecordBatchReaderBuilder`, and uses Parquet-specific row group filtering and page index logic. `TableScan::to_arrow` in `scan/mod.rs` wires `ArrowReaderBuilder` in as the only reader path. `FileScanTask` carries a `data_file_format: DataFileFormat` field, but `process_file_scan_task` ignores it. `DataFileFormat` appears throughout `src/`. Almost every non-test reference is `DataFileFormat::Parquet`. The only non-Parquet references are two uses of `DataFileFormat::Avro` in `transaction/snapshot.rs` for manifest files. + +The `DataFileFormat` enum in `spec/manifest/data_file.rs` has four variants: `Avro`, `Orc`, `Parquet`, and `Puffin`. Puffin files hold statistics and deletion vectors rather than row data. They are handled in `crates/iceberg/src/puffin/` and are out of scope for this RFC (see Non-Goals). Data file support today: + +| Format | Data file read | Data file write | Manifests | +|---------|----------------|-----------------|-----------| +| Parquet | Yes | Yes | No | +| Avro | No | No | Yes | +| ORC | No | No | No | + +A table containing ORC or Avro data files cannot be read from Rust today, even though both are valid per the Iceberg spec. + +Arrow `RecordBatch` is the only in-memory data representation in the crate. It is the input type for every writer trait and the output type for every reader. The `arrow/` module provides schema conversion (`schema_to_arrow_schema`, `arrow_schema_to_schema`), `RecordBatchTransformer` for schema evolution and constant column injection, and the Parquet-to-Arrow read path. Every integration crate, including `iceberg-datafusion`, consumes Arrow. The crate does not define a generic `Record` type and does not integrate with engine-specific row types such as Java's `InternalRow` or `RowData`. + +### Pain points + +1. **No extension point for new formats.** Adding ORC means editing `ArrowReader` to branch on `DataFileFormat` and threading ORC-specific logic through every layer that touches it. The write path has the same shape. + +2. **Parquet assumptions leak into generic code.** `ArrowReaderBuilder` exposes Parquet-specific options (`with_metadata_size_hint`, `with_row_group_filtering_enabled`, `with_row_selection_enabled`) that are meaningless for ORC or Avro. The name "ArrowReader" conflates the in-memory representation (Arrow) with the on-disk format (Parquet). + +3. **No format-agnostic statistics.** `ParquetWriter` computes column sizes, value counts, and min/max bounds through `MinMaxColAggregator` and `NanValueCountVisitor`, both tightly coupled to Parquet's `Statistics` type. Another format cannot produce comparable manifest metadata without a shared statistics interface. + +4. **V3 types will need per-format serialization.** The V3 spec adds Variant and Geometry types. Each format encodes them differently: Parquet uses variant shredding, ORC uses binary, Avro uses union types. Implementing either type without a format abstraction means new `match` arms in every reader and writer code path. + +### Prior work + +The Java project shipped `FormatModel<D, S>` in February 2026 (PR [#12774](https://github.com/apache/iceberg/pull/12774)) after a 10-month review. Implementations for Parquet, Avro, ORC, and Arrow followed in PRs [#15253](https://github.com/apache/iceberg/pull/15253) through [#15258](https://github.com/apache/iceberg/pull/15258). Engine migrations for Spark ([#15328](https://github.com/apache/iceberg/pull/15328)) and Flink ([#15329](https://github.com/apache/iceberg/pull/15329)) landed shortly after. + +Java's `FormatModel` carries two generic parameters: a data type `D` (Java uses `Record`, Spark `InternalRow`, Flink `RowData`, and Arrow `ColumnarBatch`) and an engine schema type `S` (Spark `StructType`, Flink `RowType`, and others). A static `FormatModelRegistry` stores implementations keyed by `Pair<FileFormat, Class<?>>` and populates itself through Java reflection at startup. The old `Parquet.WriteBuilder`, `Avro.WriteBuilder`, and `ORC.WriteBuilder` are not deprecated. The new `FormatModel` implementations wrap them rather than replace them. + +PyIceberg has an open proposal for the equivalent capability in issue [apache/iceberg-python#3100](https://github.com/apache/iceberg-python/issues/3100), with an in-progress PR at [apache/iceberg-python#3119](https://github.com/apache/iceberg-python/pull/3119). Because PyIceberg uses PyArrow as its only in-memory representation, the proposal drops Java's generic type parameters and keys the registry on file format alone. Two prior PyIceberg ORC PRs ([#790](https://github.com/apache/iceberg-python/pull/790), [#2236](https://github.com/apache/iceberg-python/pull/2236)) were closed as stale without merging, which reinforces the case for landing an abstraction layer before adding new formats. + +Design decisions in this RFC that differ from Java (no generics, single-dimension registry key, hard cutover from the old Parquet types) are justified inline where they appear. Specific Java design points that shaped those decisions are called out in Alternatives Considered. + +## Goals + +The user-facing outcome of this proposal is that every Iceberg data file and delete file flows through the same stable and extensible API. Parquet is the first format to land. ORC, Avro, and others follow on the same interface. Every goal below serves that outcome. + +1. Define a `FormatModel` trait that encapsulates format-specific read and write behavior independent of the on-disk format. + +2. Remove hard-coded Parquet assumptions from scan and write orchestration. After this work, `TableScan::to_arrow` and `DataFileWriterBuilder` dispatch through the format abstraction instead of constructing Parquet types directly. + +3. Provide a registry that maps `DataFileFormat` values to `FormatModel` implementations, so callers obtain readers and writers without naming the concrete format type. + +4. Define a conformance test suite (TCK) that any `FormatModel` implementation must pass before it merges. + +5. Match the Java and PyIceberg designs where they align, and diverge where Rust's single-data-type ecosystem and pre-1.0 status justify it. Divergences are called out inline. + +## Non-Goals + +The items below are deliberately out of scope to keep this proposal focused on the abstraction and its Parquet implementation. Most are follow-up work that the API enables but does not itself deliver. + +1. **Ship new format implementations.** This RFC lands the abstraction and a Parquet implementation. ORC, Avro data-file, Vortex, and Lance are follow-up RFCs. + +2. **Introduce a plugin protocol or runtime library loading.** Rust does not offer a clean mechanism for loading compiled plugins at runtime. A runtime-linking approach using `libloading` or similar would expand scope beyond what the formats currently under discussion (Parquet, ORC, Avro, Vortex, Lance) require. + +3. **Add Puffin support to the FormatModel.** Puffin files hold statistics and deletion vectors rather than row data. They have a different lifecycle from data files and are already handled separately in `crates/iceberg/src/puffin/`. + +4. **Redesign the writer trait hierarchy.** The existing `IcebergWriter` and `FileWriter` layering is sound. This RFC adds a format abstraction beneath `FileWriter`, not a replacement for it. + +5. **Implement variant shredding or encryption.** Java exposes `engineProjection` and `engineSchema` as extension points for variant shredding and similar format-specific type mapping, and `withFileEncryptionKey` and `withAADPrefix` for Parquet encryption. Equivalent hooks are noted as future extensions in the Rust design. Implementing either requires a dedicated RFC. + +6. **Change the Iceberg table spec.** This proposal is a Rust-only API change. It does not modify the Iceberg spec, the manifest format, the manifest list format, or the on-disk layout of any file. + +7. **Modify manifest read or write paths.** Manifests and manifest lists remain in Avro and are handled by the existing `ManifestReader` and `ManifestWriter` paths. The File Format API is about data files and delete files only. + + +## Design + +The Rust API is three traits and a registry. `FormatModel` is the trait that each format implementation provides. `FormatReadBuilder` and `FormatWriteBuilder` are the per-operation configurators that `FormatModel` returns. `FormatRegistry` maps `DataFileFormat` values to `FormatModel` instances. None of the traits carry generic parameters. The subsections below introduce each type, and a final "Design rationale" subsection explains the choices. + +### The FormatModel trait + +```rust +pub trait FormatModel: Send + Sync + 'static { + fn format(&self) -> DataFileFormat; + fn read_builder(&self, input: InputFile) -> Box<dyn FormatReadBuilder>; + fn write_builder(&self, output: OutputFile) -> Box<dyn FormatWriteBuilder>; +} +``` + +Each implementation registers one instance per `DataFileFormat` variant it supports. The `format` method returns that variant. `read_builder` and `write_builder` are the entry points for reading and writing a file. Both return trait objects so that the registry can hand them back from a `DataFileFormat`-keyed lookup. + +The data type is fixed to Arrow `RecordBatch`, the Iceberg schema type to `iceberg::spec::Schema`, and the physical schema type to `arrow::datatypes::SchemaRef`. These are the only types in iceberg-rust today that would fill the roles Java uses generic parameters for. Arguments for keeping them fixed, including comparisons with Java's `FormatModel<D, S>`, are in "Design rationale" below. + +### The read and write builders + +`FormatReadBuilder` and `FormatWriteBuilder` configure a single read or write. `FormatModel` produces them, and the caller consumes them with `build`. + +```rust +pub trait FormatReadBuilder: Send { + fn project(&mut self, schema: Schema) -> &mut Self; + fn filter(&mut self, predicate: BoundPredicate) -> &mut Self; + fn split(&mut self, start: u64, length: u64) -> &mut Self; + fn case_sensitive(&mut self, case_sensitive: bool) -> &mut Self; + fn batch_size(&mut self, batch_size: usize) -> &mut Self; + fn build(self: Box<Self>) -> BoxFuture<'static, Result<ArrowRecordBatchStream>>; +} + +pub trait FormatWriteBuilder: Send { + fn schema(&mut self, schema: Schema) -> &mut Self; + fn set(&mut self, key: String, value: String) -> &mut Self; + fn build(self: Box<Self>) -> BoxFuture<'static, Result<Box<dyn FormatFileWriter>>>; +} +``` + +Both builders take Iceberg `Schema` values. Format implementations convert to physical schemas internally using `schema_to_arrow_schema` from `arrow/schema.rs`. This deliberately departs from Java's `ReadBuilder<D, S>`, which has a separate `engineProjection(S)` method for variant shredding and similar per-engine type mapping. Rust's builders expose one projection surface, and the hook for a variant-shredding "engine projection" is a future-extension point described in "Design rationale." + +`FormatWriteBuilder::build` produces a `FormatFileWriter`: + +```rust +pub trait FormatFileWriter: Send { + fn write(&mut self, batch: &RecordBatch) -> BoxFuture<'_, Result<()>>; + fn close(self: Box<Self>) -> BoxFuture<'static, Result<Vec<DataFileBuilder>>>; +} +``` + +The async methods return `BoxFuture` rather than using `async fn` in traits. The `self: Box<Self>` signature on `build` and `close` lets those methods consume the value while keeping the traits object-safe. Both patterns are forced by the trait-object boundary at the registry. The "BoxFuture instead of async fn in traits" and "Dynamic dispatch at the registry, static inside the format" subsections under "Design rationale" explain why. + +### The FormatRegistry + +`FormatRegistry` maps `DataFileFormat` values to `FormatModel` instances. + +```rust +pub struct FormatRegistry { + models: HashMap<DataFileFormat, Box<dyn FormatModel>>, +} + +impl FormatRegistry { + pub fn new() -> Self { ... } + pub fn register(&mut self, model: Box<dyn FormatModel>) { ... } + pub fn read_builder( + &self, + format: DataFileFormat, + input: InputFile, + ) -> Result<Box<dyn FormatReadBuilder>> { ... } + pub fn write_builder( + &self, + format: DataFileFormat, + output: OutputFile, + ) -> Result<Box<dyn FormatWriteBuilder>> { ... } +} +``` + +Using `DataFileFormat` as a `HashMap` key requires adding `#[derive(Hash)]` to the enum. That is a non-breaking addition. + +The registry is an owned value, not a global static. Tests construct their own. Applications construct one at startup and pass it to scan planners and write orchestrators. For the common case of a single registry for the lifetime of a process, `default_format_registry()` returns a `&'static FormatRegistry` initialized through `OnceLock` on first call. + +`read_builder` and `write_builder` return `Err(Error { kind: ErrorKind::FeatureUnsupported, .. })` for unregistered formats. The error message distinguishes two cases: the format is implemented but its feature flag is disabled in this build, or the format has no implementation in this crate. + +### Feature flags + +Format implementations live in `iceberg::formats::{format}` and are gated behind a feature flag per format: `format-parquet` on by default, `format-orc` when an ORC implementation lands, and so on. The default feature set includes every format the crate implements. Users who build from source and want a smaller binary disable what they do not need. + +`FormatRegistry::default()` registers every format enabled by the current feature set at compile time: + +```rust +impl Default for FormatRegistry { + fn default() -> Self { + let mut registry = Self::new(); + #[cfg(feature = "format-parquet")] + registry.register(Box::new(ParquetFormatModel::new())); + #[cfg(feature = "format-orc")] + registry.register(Box::new(OrcFormatModel::new())); + registry + } +} +``` + +Format implementations live inside the `iceberg` crate rather than as separate dependencies. This matches how the Java project keeps its `Parquet`, `Avro`, and `ORC` modules inside the `apache/iceberg` repository. In-tree implementations let the community control each format's quality and release cadence and keep the build graph narrow for downstream consumers. Nothing in the trait design prevents a downstream crate from defining its own `FormatModel` and registering it with a custom `FormatRegistry`, but in-tree is the recommended path for formats intended to ship in iceberg-rust itself. + +The `parquet` crate remains an unconditional dependency of `iceberg` for now, because non-format code (page index evaluators, row group metric evaluators, delete file loaders) still uses `parquet` types directly. A later pass gates the `parquet` dependency on the `format-parquet` feature once those callers move to format-agnostic abstractions. + +### Module layout + +``` +crates/iceberg/src/formats/ +├── mod.rs # FormatModel, FormatReadBuilder, FormatWriteBuilder, FormatFileWriter +├── registry.rs # FormatRegistry, default_format_registry +└── parquet.rs # ParquetFormatModel, wrapping existing ParquetWriter and ArrowReader +``` + +Additional formats land as `formats/orc/`, `formats/avro_data/`, and so on, with directory-per-format for anything larger than a few hundred lines. + +In the initial implementation, `ParquetWriter`, `ParquetWriterBuilder`, `ArrowReader`, and `ArrowReaderBuilder` stay at their current module paths. `ParquetFormatModel` wraps them. Phase 3 of the Migration Plan moves them into `formats/parquet/` and retires the old paths. + +### Design rationale + +#### No generic over the data type + +Java's `FormatModel<D, S>` uses `D` for Spark `InternalRow`, Flink `RowData`, Arrow `ColumnarBatch`, and other engine-native row types. iceberg-rust has one row type: Arrow `RecordBatch`. Every writer accepts it, and every reader returns it. No format on the near-term queue (ORC, Avro data-file, Vortex, Lance) produces anything other than `RecordBatch` at the Iceberg-facing boundary. No engine integration in iceberg-rust today brings an engine-native row type the way Spark and Flink do in Java. + +Adding a `D` parameter today means writing `<RecordBatch>` everywhere the trait is used, for no present caller benefit. If a future format cannot bridge to Arrow, the trait can gain an associated type with a default, which is a semver-compatible addition. + +#### No generic over the engine schema + +Java's `S` parameter serves Spark's `StructType`, Flink's `RowType`, and the other engine-native schema types. iceberg-rust has `iceberg::spec::Schema` for the logical schema and `arrow::datatypes::SchemaRef` for the physical schema, with conversion in `arrow/schema.rs`. There is no third schema type a generic parameter would serve. + +Variant shredding is the concrete use case that Java's `engineProjection(S)` method addresses. `FormatReadBuilder` can later gain an `engine_projection(&mut self, schema: ArrowSchemaRef) -> &mut Self` method with a default no-op implementation. Format implementations that support shredding override it. The parameter is `ArrowSchemaRef`, not a generic `S`, because Arrow is the only engine schema in iceberg-rust. + +#### Dynamic dispatch at the registry, static inside the format + +Registry lookup is inherently dynamic: the caller has a `DataFileFormat` value at runtime. The read or write hot loop should use static dispatch and inline normally. Splitting the two puts `Box<dyn FormatModel>` at the registry boundary and concrete types everywhere inside the format implementation. Dispatch through the trait object happens once per `read_builder` or `write_builder` call, which is once per file. The hot loop sees concrete types. + +`object_store` uses the same split (`dyn ObjectStore` at the boundary, concrete `AsyncRead` and `AsyncWrite` streams returned). `datafusion` uses the same split (`dyn TableProvider` at the boundary, concrete `RecordBatchStream` returned). + +#### BoxFuture instead of async fn in traits + +`async fn` in traits (stable since Rust 1.75) works for static dispatch. It does not work through `dyn Trait`, because the returned future has an opaque type that trait-object dispatch cannot carry. The `async-trait` crate desugars `async fn` to the same `BoxFuture` a manual implementation produces. Using `BoxFuture` explicitly keeps the future's lifetime visible in the signature and avoids one proc-macro expansion per trait. + +#### Feature flags instead of inventory or libloading + +Two other registration mechanisms were considered. + +The `inventory` crate collects static items at link time through platform-specific linker sections. Each format would register itself with a `submit!` macro. `inventory` drops items silently across static-library boundaries, interacts poorly with analysis tools that do not run a real linker, and registers formats that the caller did not explicitly depend on. + +Runtime library loading through `libloading` or similar would allow formats to ship as separate shared libraries. Rust has no stable ABI, so implementations would need to match the host's compiler version and feature flags exactly. Every use would need `unsafe`. PyIceberg has flagged runtime loading as an open question without landing a mechanism. + +Feature flags work across `cargo check`, `cargo miri`, cross-compilation, and static linking. `datafusion`, `opendal`, and `reqwest` use the same pattern. + +#### RecordBatch as the canonical data type Review Comment: Is this the same as the first point? ########## docs/rfcs/0003_file_format_api.md: ########## @@ -0,0 +1,520 @@ +<!-- + ~ Licensed to the Apache Software Foundation (ASF) under one + ~ or more contributor license agreements. See the NOTICE file + ~ distributed with this work for additional information + ~ regarding copyright ownership. The ASF licenses this file + ~ to you under the Apache License, Version 2.0 (the + ~ "License"); you may not use this file except in compliance + ~ with the License. You may obtain a copy of the License at + ~ + ~ http://www.apache.org/licenses/LICENSE-2.0 + ~ + ~ Unless required by applicable law or agreed to in writing, + ~ software distributed under the License is distributed on an + ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + ~ KIND, either express or implied. See the License for the + ~ specific language governing permissions and limitations + ~ under the License. +--> + +# RFC: File Format API for Apache Iceberg Rust + +## Background + +### Current state + +The `iceberg` crate (version 0.9.0, Rust 1.92 as of this writing) is the core library of the Apache Iceberg Rust project. Its module layout in `crates/iceberg/src/lib.rs` exposes public modules for `spec`, `arrow`, `writer`, `scan`, `io`, `expr`, `transaction`, and others. The crate depends directly on the `parquet` crate (with the `async` feature) and on the `arrow-*` crates. It has no feature flags today. + +For data file writing, the crate provides a three-layer architecture described in `crates/iceberg/src/writer/mod.rs`: + +1. **`FileWriter` / `FileWriterBuilder`** in `writer/file_writer/mod.rs`: traits for physical file writers, generic over an output type (defaulting to `Vec<DataFileBuilder>`). `FileWriterBuilder::build` takes an `OutputFile` and returns a `FileWriter`. `FileWriter::write` takes a `&RecordBatch`. + +2. **`IcebergWriter` / `IcebergWriterBuilder`** in `writer/mod.rs`: traits for logical Iceberg writers (data files, equality deletes, position deletes, partitioning), defaulting to `RecordBatch` input and `Vec<DataFile>` output. + +3. **Concrete implementations**: `ParquetWriterBuilder` and `ParquetWriter` in `writer/file_writer/parquet_writer.rs` are the only `FileWriter` implementation. Higher-level writers such as `DataFileWriterBuilder` (`writer/base_writer/data_file_writer.rs`) and `EqualityDeleteFileWriterBuilder` (`writer/base_writer/equality_delete_writer.rs`) are generic over any `FileWriterBuilder`, but every example, test, and integration instantiates them with `ParquetWriterBuilder`. + +The trait layer is format-agnostic. Every concrete instantiation uses Parquet. + +For data file reading, the crate provides `ArrowReaderBuilder` and `ArrowReader` in `arrow/reader.rs`. Both are Parquet-specific despite the generic name. `ArrowReader::process_file_scan_task` calls `open_parquet_file` directly, constructs a `ParquetRecordBatchReaderBuilder`, and uses Parquet-specific row group filtering and page index logic. `TableScan::to_arrow` in `scan/mod.rs` wires `ArrowReaderBuilder` in as the only reader path. `FileScanTask` carries a `data_file_format: DataFileFormat` field, but `process_file_scan_task` ignores it. `DataFileFormat` appears throughout `src/`. Almost every non-test reference is `DataFileFormat::Parquet`. The only non-Parquet references are two uses of `DataFileFormat::Avro` in `transaction/snapshot.rs` for manifest files. + +The `DataFileFormat` enum in `spec/manifest/data_file.rs` has four variants: `Avro`, `Orc`, `Parquet`, and `Puffin`. Puffin files hold statistics and deletion vectors rather than row data. They are handled in `crates/iceberg/src/puffin/` and are out of scope for this RFC (see Non-Goals). Data file support today: + +| Format | Data file read | Data file write | Manifests | +|---------|----------------|-----------------|-----------| +| Parquet | Yes | Yes | No | +| Avro | No | No | Yes | +| ORC | No | No | No | + +A table containing ORC or Avro data files cannot be read from Rust today, even though both are valid per the Iceberg spec. + +Arrow `RecordBatch` is the only in-memory data representation in the crate. It is the input type for every writer trait and the output type for every reader. The `arrow/` module provides schema conversion (`schema_to_arrow_schema`, `arrow_schema_to_schema`), `RecordBatchTransformer` for schema evolution and constant column injection, and the Parquet-to-Arrow read path. Every integration crate, including `iceberg-datafusion`, consumes Arrow. The crate does not define a generic `Record` type and does not integrate with engine-specific row types such as Java's `InternalRow` or `RowData`. + +### Pain points + +1. **No extension point for new formats.** Adding ORC means editing `ArrowReader` to branch on `DataFileFormat` and threading ORC-specific logic through every layer that touches it. The write path has the same shape. + +2. **Parquet assumptions leak into generic code.** `ArrowReaderBuilder` exposes Parquet-specific options (`with_metadata_size_hint`, `with_row_group_filtering_enabled`, `with_row_selection_enabled`) that are meaningless for ORC or Avro. The name "ArrowReader" conflates the in-memory representation (Arrow) with the on-disk format (Parquet). + +3. **No format-agnostic statistics.** `ParquetWriter` computes column sizes, value counts, and min/max bounds through `MinMaxColAggregator` and `NanValueCountVisitor`, both tightly coupled to Parquet's `Statistics` type. Another format cannot produce comparable manifest metadata without a shared statistics interface. + +4. **V3 types will need per-format serialization.** The V3 spec adds Variant and Geometry types. Each format encodes them differently: Parquet uses variant shredding, ORC uses binary, Avro uses union types. Implementing either type without a format abstraction means new `match` arms in every reader and writer code path. + +### Prior work + +The Java project shipped `FormatModel<D, S>` in February 2026 (PR [#12774](https://github.com/apache/iceberg/pull/12774)) after a 10-month review. Implementations for Parquet, Avro, ORC, and Arrow followed in PRs [#15253](https://github.com/apache/iceberg/pull/15253) through [#15258](https://github.com/apache/iceberg/pull/15258). Engine migrations for Spark ([#15328](https://github.com/apache/iceberg/pull/15328)) and Flink ([#15329](https://github.com/apache/iceberg/pull/15329)) landed shortly after. + +Java's `FormatModel` carries two generic parameters: a data type `D` (Java uses `Record`, Spark `InternalRow`, Flink `RowData`, and Arrow `ColumnarBatch`) and an engine schema type `S` (Spark `StructType`, Flink `RowType`, and others). A static `FormatModelRegistry` stores implementations keyed by `Pair<FileFormat, Class<?>>` and populates itself through Java reflection at startup. The old `Parquet.WriteBuilder`, `Avro.WriteBuilder`, and `ORC.WriteBuilder` are not deprecated. The new `FormatModel` implementations wrap them rather than replace them. + +PyIceberg has an open proposal for the equivalent capability in issue [apache/iceberg-python#3100](https://github.com/apache/iceberg-python/issues/3100), with an in-progress PR at [apache/iceberg-python#3119](https://github.com/apache/iceberg-python/pull/3119). Because PyIceberg uses PyArrow as its only in-memory representation, the proposal drops Java's generic type parameters and keys the registry on file format alone. Two prior PyIceberg ORC PRs ([#790](https://github.com/apache/iceberg-python/pull/790), [#2236](https://github.com/apache/iceberg-python/pull/2236)) were closed as stale without merging, which reinforces the case for landing an abstraction layer before adding new formats. + +Design decisions in this RFC that differ from Java (no generics, single-dimension registry key, hard cutover from the old Parquet types) are justified inline where they appear. Specific Java design points that shaped those decisions are called out in Alternatives Considered. + +## Goals + +The user-facing outcome of this proposal is that every Iceberg data file and delete file flows through the same stable and extensible API. Parquet is the first format to land. ORC, Avro, and others follow on the same interface. Every goal below serves that outcome. + +1. Define a `FormatModel` trait that encapsulates format-specific read and write behavior independent of the on-disk format. + +2. Remove hard-coded Parquet assumptions from scan and write orchestration. After this work, `TableScan::to_arrow` and `DataFileWriterBuilder` dispatch through the format abstraction instead of constructing Parquet types directly. + +3. Provide a registry that maps `DataFileFormat` values to `FormatModel` implementations, so callers obtain readers and writers without naming the concrete format type. + +4. Define a conformance test suite (TCK) that any `FormatModel` implementation must pass before it merges. + +5. Match the Java and PyIceberg designs where they align, and diverge where Rust's single-data-type ecosystem and pre-1.0 status justify it. Divergences are called out inline. + +## Non-Goals + +The items below are deliberately out of scope to keep this proposal focused on the abstraction and its Parquet implementation. Most are follow-up work that the API enables but does not itself deliver. + +1. **Ship new format implementations.** This RFC lands the abstraction and a Parquet implementation. ORC, Avro data-file, Vortex, and Lance are follow-up RFCs. + +2. **Introduce a plugin protocol or runtime library loading.** Rust does not offer a clean mechanism for loading compiled plugins at runtime. A runtime-linking approach using `libloading` or similar would expand scope beyond what the formats currently under discussion (Parquet, ORC, Avro, Vortex, Lance) require. + +3. **Add Puffin support to the FormatModel.** Puffin files hold statistics and deletion vectors rather than row data. They have a different lifecycle from data files and are already handled separately in `crates/iceberg/src/puffin/`. + +4. **Redesign the writer trait hierarchy.** The existing `IcebergWriter` and `FileWriter` layering is sound. This RFC adds a format abstraction beneath `FileWriter`, not a replacement for it. + +5. **Implement variant shredding or encryption.** Java exposes `engineProjection` and `engineSchema` as extension points for variant shredding and similar format-specific type mapping, and `withFileEncryptionKey` and `withAADPrefix` for Parquet encryption. Equivalent hooks are noted as future extensions in the Rust design. Implementing either requires a dedicated RFC. + +6. **Change the Iceberg table spec.** This proposal is a Rust-only API change. It does not modify the Iceberg spec, the manifest format, the manifest list format, or the on-disk layout of any file. + +7. **Modify manifest read or write paths.** Manifests and manifest lists remain in Avro and are handled by the existing `ManifestReader` and `ManifestWriter` paths. The File Format API is about data files and delete files only. + + +## Design + +The Rust API is three traits and a registry. `FormatModel` is the trait that each format implementation provides. `FormatReadBuilder` and `FormatWriteBuilder` are the per-operation configurators that `FormatModel` returns. `FormatRegistry` maps `DataFileFormat` values to `FormatModel` instances. None of the traits carry generic parameters. The subsections below introduce each type, and a final "Design rationale" subsection explains the choices. + +### The FormatModel trait + +```rust +pub trait FormatModel: Send + Sync + 'static { + fn format(&self) -> DataFileFormat; + fn read_builder(&self, input: InputFile) -> Box<dyn FormatReadBuilder>; + fn write_builder(&self, output: OutputFile) -> Box<dyn FormatWriteBuilder>; +} +``` + +Each implementation registers one instance per `DataFileFormat` variant it supports. The `format` method returns that variant. `read_builder` and `write_builder` are the entry points for reading and writing a file. Both return trait objects so that the registry can hand them back from a `DataFileFormat`-keyed lookup. + +The data type is fixed to Arrow `RecordBatch`, the Iceberg schema type to `iceberg::spec::Schema`, and the physical schema type to `arrow::datatypes::SchemaRef`. These are the only types in iceberg-rust today that would fill the roles Java uses generic parameters for. Arguments for keeping them fixed, including comparisons with Java's `FormatModel<D, S>`, are in "Design rationale" below. + +### The read and write builders + +`FormatReadBuilder` and `FormatWriteBuilder` configure a single read or write. `FormatModel` produces them, and the caller consumes them with `build`. + +```rust +pub trait FormatReadBuilder: Send { + fn project(&mut self, schema: Schema) -> &mut Self; + fn filter(&mut self, predicate: BoundPredicate) -> &mut Self; + fn split(&mut self, start: u64, length: u64) -> &mut Self; + fn case_sensitive(&mut self, case_sensitive: bool) -> &mut Self; + fn batch_size(&mut self, batch_size: usize) -> &mut Self; + fn build(self: Box<Self>) -> BoxFuture<'static, Result<ArrowRecordBatchStream>>; +} + +pub trait FormatWriteBuilder: Send { + fn schema(&mut self, schema: Schema) -> &mut Self; + fn set(&mut self, key: String, value: String) -> &mut Self; + fn build(self: Box<Self>) -> BoxFuture<'static, Result<Box<dyn FormatFileWriter>>>; +} +``` + +Both builders take Iceberg `Schema` values. Format implementations convert to physical schemas internally using `schema_to_arrow_schema` from `arrow/schema.rs`. This deliberately departs from Java's `ReadBuilder<D, S>`, which has a separate `engineProjection(S)` method for variant shredding and similar per-engine type mapping. Rust's builders expose one projection surface, and the hook for a variant-shredding "engine projection" is a future-extension point described in "Design rationale." + +`FormatWriteBuilder::build` produces a `FormatFileWriter`: + +```rust +pub trait FormatFileWriter: Send { + fn write(&mut self, batch: &RecordBatch) -> BoxFuture<'_, Result<()>>; + fn close(self: Box<Self>) -> BoxFuture<'static, Result<Vec<DataFileBuilder>>>; +} +``` + +The async methods return `BoxFuture` rather than using `async fn` in traits. The `self: Box<Self>` signature on `build` and `close` lets those methods consume the value while keeping the traits object-safe. Both patterns are forced by the trait-object boundary at the registry. The "BoxFuture instead of async fn in traits" and "Dynamic dispatch at the registry, static inside the format" subsections under "Design rationale" explain why. + +### The FormatRegistry + +`FormatRegistry` maps `DataFileFormat` values to `FormatModel` instances. + +```rust +pub struct FormatRegistry { + models: HashMap<DataFileFormat, Box<dyn FormatModel>>, +} + +impl FormatRegistry { + pub fn new() -> Self { ... } + pub fn register(&mut self, model: Box<dyn FormatModel>) { ... } + pub fn read_builder( + &self, + format: DataFileFormat, + input: InputFile, + ) -> Result<Box<dyn FormatReadBuilder>> { ... } + pub fn write_builder( + &self, + format: DataFileFormat, + output: OutputFile, + ) -> Result<Box<dyn FormatWriteBuilder>> { ... } +} +``` + +Using `DataFileFormat` as a `HashMap` key requires adding `#[derive(Hash)]` to the enum. That is a non-breaking addition. + +The registry is an owned value, not a global static. Tests construct their own. Applications construct one at startup and pass it to scan planners and write orchestrators. For the common case of a single registry for the lifetime of a process, `default_format_registry()` returns a `&'static FormatRegistry` initialized through `OnceLock` on first call. + +`read_builder` and `write_builder` return `Err(Error { kind: ErrorKind::FeatureUnsupported, .. })` for unregistered formats. The error message distinguishes two cases: the format is implemented but its feature flag is disabled in this build, or the format has no implementation in this crate. + +### Feature flags + +Format implementations live in `iceberg::formats::{format}` and are gated behind a feature flag per format: `format-parquet` on by default, `format-orc` when an ORC implementation lands, and so on. The default feature set includes every format the crate implements. Users who build from source and want a smaller binary disable what they do not need. + +`FormatRegistry::default()` registers every format enabled by the current feature set at compile time: + +```rust +impl Default for FormatRegistry { + fn default() -> Self { + let mut registry = Self::new(); + #[cfg(feature = "format-parquet")] + registry.register(Box::new(ParquetFormatModel::new())); + #[cfg(feature = "format-orc")] + registry.register(Box::new(OrcFormatModel::new())); + registry + } +} +``` + +Format implementations live inside the `iceberg` crate rather than as separate dependencies. This matches how the Java project keeps its `Parquet`, `Avro`, and `ORC` modules inside the `apache/iceberg` repository. In-tree implementations let the community control each format's quality and release cadence and keep the build graph narrow for downstream consumers. Nothing in the trait design prevents a downstream crate from defining its own `FormatModel` and registering it with a custom `FormatRegistry`, but in-tree is the recommended path for formats intended to ship in iceberg-rust itself. + +The `parquet` crate remains an unconditional dependency of `iceberg` for now, because non-format code (page index evaluators, row group metric evaluators, delete file loaders) still uses `parquet` types directly. A later pass gates the `parquet` dependency on the `format-parquet` feature once those callers move to format-agnostic abstractions. + +### Module layout + +``` +crates/iceberg/src/formats/ +├── mod.rs # FormatModel, FormatReadBuilder, FormatWriteBuilder, FormatFileWriter +├── registry.rs # FormatRegistry, default_format_registry +└── parquet.rs # ParquetFormatModel, wrapping existing ParquetWriter and ArrowReader +``` + +Additional formats land as `formats/orc/`, `formats/avro_data/`, and so on, with directory-per-format for anything larger than a few hundred lines. + +In the initial implementation, `ParquetWriter`, `ParquetWriterBuilder`, `ArrowReader`, and `ArrowReaderBuilder` stay at their current module paths. `ParquetFormatModel` wraps them. Phase 3 of the Migration Plan moves them into `formats/parquet/` and retires the old paths. + +### Design rationale + +#### No generic over the data type + +Java's `FormatModel<D, S>` uses `D` for Spark `InternalRow`, Flink `RowData`, Arrow `ColumnarBatch`, and other engine-native row types. iceberg-rust has one row type: Arrow `RecordBatch`. Every writer accepts it, and every reader returns it. No format on the near-term queue (ORC, Avro data-file, Vortex, Lance) produces anything other than `RecordBatch` at the Iceberg-facing boundary. No engine integration in iceberg-rust today brings an engine-native row type the way Spark and Flink do in Java. + +Adding a `D` parameter today means writing `<RecordBatch>` everywhere the trait is used, for no present caller benefit. If a future format cannot bridge to Arrow, the trait can gain an associated type with a default, which is a semver-compatible addition. Review Comment: > the trait can gain an associated type with a default I think we can do it in the initial cut, but I have not investigated the required effort for this ########## docs/rfcs/0003_file_format_api.md: ########## @@ -0,0 +1,520 @@ +<!-- + ~ Licensed to the Apache Software Foundation (ASF) under one + ~ or more contributor license agreements. See the NOTICE file + ~ distributed with this work for additional information + ~ regarding copyright ownership. The ASF licenses this file + ~ to you under the Apache License, Version 2.0 (the + ~ "License"); you may not use this file except in compliance + ~ with the License. You may obtain a copy of the License at + ~ + ~ http://www.apache.org/licenses/LICENSE-2.0 + ~ + ~ Unless required by applicable law or agreed to in writing, + ~ software distributed under the License is distributed on an + ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + ~ KIND, either express or implied. See the License for the + ~ specific language governing permissions and limitations + ~ under the License. +--> + +# RFC: File Format API for Apache Iceberg Rust + +## Background + +### Current state + +The `iceberg` crate (version 0.9.0, Rust 1.92 as of this writing) is the core library of the Apache Iceberg Rust project. Its module layout in `crates/iceberg/src/lib.rs` exposes public modules for `spec`, `arrow`, `writer`, `scan`, `io`, `expr`, `transaction`, and others. The crate depends directly on the `parquet` crate (with the `async` feature) and on the `arrow-*` crates. It has no feature flags today. + +For data file writing, the crate provides a three-layer architecture described in `crates/iceberg/src/writer/mod.rs`: + +1. **`FileWriter` / `FileWriterBuilder`** in `writer/file_writer/mod.rs`: traits for physical file writers, generic over an output type (defaulting to `Vec<DataFileBuilder>`). `FileWriterBuilder::build` takes an `OutputFile` and returns a `FileWriter`. `FileWriter::write` takes a `&RecordBatch`. + +2. **`IcebergWriter` / `IcebergWriterBuilder`** in `writer/mod.rs`: traits for logical Iceberg writers (data files, equality deletes, position deletes, partitioning), defaulting to `RecordBatch` input and `Vec<DataFile>` output. + +3. **Concrete implementations**: `ParquetWriterBuilder` and `ParquetWriter` in `writer/file_writer/parquet_writer.rs` are the only `FileWriter` implementation. Higher-level writers such as `DataFileWriterBuilder` (`writer/base_writer/data_file_writer.rs`) and `EqualityDeleteFileWriterBuilder` (`writer/base_writer/equality_delete_writer.rs`) are generic over any `FileWriterBuilder`, but every example, test, and integration instantiates them with `ParquetWriterBuilder`. + +The trait layer is format-agnostic. Every concrete instantiation uses Parquet. + +For data file reading, the crate provides `ArrowReaderBuilder` and `ArrowReader` in `arrow/reader.rs`. Both are Parquet-specific despite the generic name. `ArrowReader::process_file_scan_task` calls `open_parquet_file` directly, constructs a `ParquetRecordBatchReaderBuilder`, and uses Parquet-specific row group filtering and page index logic. `TableScan::to_arrow` in `scan/mod.rs` wires `ArrowReaderBuilder` in as the only reader path. `FileScanTask` carries a `data_file_format: DataFileFormat` field, but `process_file_scan_task` ignores it. `DataFileFormat` appears throughout `src/`. Almost every non-test reference is `DataFileFormat::Parquet`. The only non-Parquet references are two uses of `DataFileFormat::Avro` in `transaction/snapshot.rs` for manifest files. + +The `DataFileFormat` enum in `spec/manifest/data_file.rs` has four variants: `Avro`, `Orc`, `Parquet`, and `Puffin`. Puffin files hold statistics and deletion vectors rather than row data. They are handled in `crates/iceberg/src/puffin/` and are out of scope for this RFC (see Non-Goals). Data file support today: + +| Format | Data file read | Data file write | Manifests | +|---------|----------------|-----------------|-----------| +| Parquet | Yes | Yes | No | +| Avro | No | No | Yes | +| ORC | No | No | No | + +A table containing ORC or Avro data files cannot be read from Rust today, even though both are valid per the Iceberg spec. + +Arrow `RecordBatch` is the only in-memory data representation in the crate. It is the input type for every writer trait and the output type for every reader. The `arrow/` module provides schema conversion (`schema_to_arrow_schema`, `arrow_schema_to_schema`), `RecordBatchTransformer` for schema evolution and constant column injection, and the Parquet-to-Arrow read path. Every integration crate, including `iceberg-datafusion`, consumes Arrow. The crate does not define a generic `Record` type and does not integrate with engine-specific row types such as Java's `InternalRow` or `RowData`. + +### Pain points + +1. **No extension point for new formats.** Adding ORC means editing `ArrowReader` to branch on `DataFileFormat` and threading ORC-specific logic through every layer that touches it. The write path has the same shape. + +2. **Parquet assumptions leak into generic code.** `ArrowReaderBuilder` exposes Parquet-specific options (`with_metadata_size_hint`, `with_row_group_filtering_enabled`, `with_row_selection_enabled`) that are meaningless for ORC or Avro. The name "ArrowReader" conflates the in-memory representation (Arrow) with the on-disk format (Parquet). + +3. **No format-agnostic statistics.** `ParquetWriter` computes column sizes, value counts, and min/max bounds through `MinMaxColAggregator` and `NanValueCountVisitor`, both tightly coupled to Parquet's `Statistics` type. Another format cannot produce comparable manifest metadata without a shared statistics interface. + +4. **V3 types will need per-format serialization.** The V3 spec adds Variant and Geometry types. Each format encodes them differently: Parquet uses variant shredding, ORC uses binary, Avro uses union types. Implementing either type without a format abstraction means new `match` arms in every reader and writer code path. + +### Prior work + +The Java project shipped `FormatModel<D, S>` in February 2026 (PR [#12774](https://github.com/apache/iceberg/pull/12774)) after a 10-month review. Implementations for Parquet, Avro, ORC, and Arrow followed in PRs [#15253](https://github.com/apache/iceberg/pull/15253) through [#15258](https://github.com/apache/iceberg/pull/15258). Engine migrations for Spark ([#15328](https://github.com/apache/iceberg/pull/15328)) and Flink ([#15329](https://github.com/apache/iceberg/pull/15329)) landed shortly after. + +Java's `FormatModel` carries two generic parameters: a data type `D` (Java uses `Record`, Spark `InternalRow`, Flink `RowData`, and Arrow `ColumnarBatch`) and an engine schema type `S` (Spark `StructType`, Flink `RowType`, and others). A static `FormatModelRegistry` stores implementations keyed by `Pair<FileFormat, Class<?>>` and populates itself through Java reflection at startup. The old `Parquet.WriteBuilder`, `Avro.WriteBuilder`, and `ORC.WriteBuilder` are not deprecated. The new `FormatModel` implementations wrap them rather than replace them. + +PyIceberg has an open proposal for the equivalent capability in issue [apache/iceberg-python#3100](https://github.com/apache/iceberg-python/issues/3100), with an in-progress PR at [apache/iceberg-python#3119](https://github.com/apache/iceberg-python/pull/3119). Because PyIceberg uses PyArrow as its only in-memory representation, the proposal drops Java's generic type parameters and keys the registry on file format alone. Two prior PyIceberg ORC PRs ([#790](https://github.com/apache/iceberg-python/pull/790), [#2236](https://github.com/apache/iceberg-python/pull/2236)) were closed as stale without merging, which reinforces the case for landing an abstraction layer before adding new formats. + +Design decisions in this RFC that differ from Java (no generics, single-dimension registry key, hard cutover from the old Parquet types) are justified inline where they appear. Specific Java design points that shaped those decisions are called out in Alternatives Considered. + +## Goals + +The user-facing outcome of this proposal is that every Iceberg data file and delete file flows through the same stable and extensible API. Parquet is the first format to land. ORC, Avro, and others follow on the same interface. Every goal below serves that outcome. + +1. Define a `FormatModel` trait that encapsulates format-specific read and write behavior independent of the on-disk format. + +2. Remove hard-coded Parquet assumptions from scan and write orchestration. After this work, `TableScan::to_arrow` and `DataFileWriterBuilder` dispatch through the format abstraction instead of constructing Parquet types directly. + +3. Provide a registry that maps `DataFileFormat` values to `FormatModel` implementations, so callers obtain readers and writers without naming the concrete format type. + +4. Define a conformance test suite (TCK) that any `FormatModel` implementation must pass before it merges. + +5. Match the Java and PyIceberg designs where they align, and diverge where Rust's single-data-type ecosystem and pre-1.0 status justify it. Divergences are called out inline. + +## Non-Goals + +The items below are deliberately out of scope to keep this proposal focused on the abstraction and its Parquet implementation. Most are follow-up work that the API enables but does not itself deliver. + +1. **Ship new format implementations.** This RFC lands the abstraction and a Parquet implementation. ORC, Avro data-file, Vortex, and Lance are follow-up RFCs. + +2. **Introduce a plugin protocol or runtime library loading.** Rust does not offer a clean mechanism for loading compiled plugins at runtime. A runtime-linking approach using `libloading` or similar would expand scope beyond what the formats currently under discussion (Parquet, ORC, Avro, Vortex, Lance) require. + +3. **Add Puffin support to the FormatModel.** Puffin files hold statistics and deletion vectors rather than row data. They have a different lifecycle from data files and are already handled separately in `crates/iceberg/src/puffin/`. + +4. **Redesign the writer trait hierarchy.** The existing `IcebergWriter` and `FileWriter` layering is sound. This RFC adds a format abstraction beneath `FileWriter`, not a replacement for it. + +5. **Implement variant shredding or encryption.** Java exposes `engineProjection` and `engineSchema` as extension points for variant shredding and similar format-specific type mapping, and `withFileEncryptionKey` and `withAADPrefix` for Parquet encryption. Equivalent hooks are noted as future extensions in the Rust design. Implementing either requires a dedicated RFC. + +6. **Change the Iceberg table spec.** This proposal is a Rust-only API change. It does not modify the Iceberg spec, the manifest format, the manifest list format, or the on-disk layout of any file. + +7. **Modify manifest read or write paths.** Manifests and manifest lists remain in Avro and are handled by the existing `ManifestReader` and `ManifestWriter` paths. The File Format API is about data files and delete files only. + + +## Design + +The Rust API is three traits and a registry. `FormatModel` is the trait that each format implementation provides. `FormatReadBuilder` and `FormatWriteBuilder` are the per-operation configurators that `FormatModel` returns. `FormatRegistry` maps `DataFileFormat` values to `FormatModel` instances. None of the traits carry generic parameters. The subsections below introduce each type, and a final "Design rationale" subsection explains the choices. + +### The FormatModel trait + +```rust +pub trait FormatModel: Send + Sync + 'static { + fn format(&self) -> DataFileFormat; + fn read_builder(&self, input: InputFile) -> Box<dyn FormatReadBuilder>; + fn write_builder(&self, output: OutputFile) -> Box<dyn FormatWriteBuilder>; +} +``` + +Each implementation registers one instance per `DataFileFormat` variant it supports. The `format` method returns that variant. `read_builder` and `write_builder` are the entry points for reading and writing a file. Both return trait objects so that the registry can hand them back from a `DataFileFormat`-keyed lookup. + +The data type is fixed to Arrow `RecordBatch`, the Iceberg schema type to `iceberg::spec::Schema`, and the physical schema type to `arrow::datatypes::SchemaRef`. These are the only types in iceberg-rust today that would fill the roles Java uses generic parameters for. Arguments for keeping them fixed, including comparisons with Java's `FormatModel<D, S>`, are in "Design rationale" below. + +### The read and write builders + +`FormatReadBuilder` and `FormatWriteBuilder` configure a single read or write. `FormatModel` produces them, and the caller consumes them with `build`. + +```rust +pub trait FormatReadBuilder: Send { + fn project(&mut self, schema: Schema) -> &mut Self; + fn filter(&mut self, predicate: BoundPredicate) -> &mut Self; + fn split(&mut self, start: u64, length: u64) -> &mut Self; + fn case_sensitive(&mut self, case_sensitive: bool) -> &mut Self; + fn batch_size(&mut self, batch_size: usize) -> &mut Self; + fn build(self: Box<Self>) -> BoxFuture<'static, Result<ArrowRecordBatchStream>>; +} + +pub trait FormatWriteBuilder: Send { + fn schema(&mut self, schema: Schema) -> &mut Self; + fn set(&mut self, key: String, value: String) -> &mut Self; + fn build(self: Box<Self>) -> BoxFuture<'static, Result<Box<dyn FormatFileWriter>>>; +} +``` + +Both builders take Iceberg `Schema` values. Format implementations convert to physical schemas internally using `schema_to_arrow_schema` from `arrow/schema.rs`. This deliberately departs from Java's `ReadBuilder<D, S>`, which has a separate `engineProjection(S)` method for variant shredding and similar per-engine type mapping. Rust's builders expose one projection surface, and the hook for a variant-shredding "engine projection" is a future-extension point described in "Design rationale." + +`FormatWriteBuilder::build` produces a `FormatFileWriter`: + +```rust +pub trait FormatFileWriter: Send { + fn write(&mut self, batch: &RecordBatch) -> BoxFuture<'_, Result<()>>; + fn close(self: Box<Self>) -> BoxFuture<'static, Result<Vec<DataFileBuilder>>>; +} +``` + +The async methods return `BoxFuture` rather than using `async fn` in traits. The `self: Box<Self>` signature on `build` and `close` lets those methods consume the value while keeping the traits object-safe. Both patterns are forced by the trait-object boundary at the registry. The "BoxFuture instead of async fn in traits" and "Dynamic dispatch at the registry, static inside the format" subsections under "Design rationale" explain why. + +### The FormatRegistry + +`FormatRegistry` maps `DataFileFormat` values to `FormatModel` instances. + +```rust +pub struct FormatRegistry { + models: HashMap<DataFileFormat, Box<dyn FormatModel>>, +} + +impl FormatRegistry { + pub fn new() -> Self { ... } + pub fn register(&mut self, model: Box<dyn FormatModel>) { ... } + pub fn read_builder( + &self, + format: DataFileFormat, + input: InputFile, + ) -> Result<Box<dyn FormatReadBuilder>> { ... } + pub fn write_builder( + &self, + format: DataFileFormat, + output: OutputFile, + ) -> Result<Box<dyn FormatWriteBuilder>> { ... } +} +``` + +Using `DataFileFormat` as a `HashMap` key requires adding `#[derive(Hash)]` to the enum. That is a non-breaking addition. + +The registry is an owned value, not a global static. Tests construct their own. Applications construct one at startup and pass it to scan planners and write orchestrators. For the common case of a single registry for the lifetime of a process, `default_format_registry()` returns a `&'static FormatRegistry` initialized through `OnceLock` on first call. Review Comment: Would love to see more details on scan planner's API changes. Also what about writing? ########## docs/rfcs/0003_file_format_api.md: ########## @@ -0,0 +1,520 @@ +<!-- + ~ Licensed to the Apache Software Foundation (ASF) under one + ~ or more contributor license agreements. See the NOTICE file + ~ distributed with this work for additional information + ~ regarding copyright ownership. The ASF licenses this file + ~ to you under the Apache License, Version 2.0 (the + ~ "License"); you may not use this file except in compliance + ~ with the License. You may obtain a copy of the License at + ~ + ~ http://www.apache.org/licenses/LICENSE-2.0 + ~ + ~ Unless required by applicable law or agreed to in writing, + ~ software distributed under the License is distributed on an + ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + ~ KIND, either express or implied. See the License for the + ~ specific language governing permissions and limitations + ~ under the License. +--> + +# RFC: File Format API for Apache Iceberg Rust + +## Background + +### Current state + +The `iceberg` crate (version 0.9.0, Rust 1.92 as of this writing) is the core library of the Apache Iceberg Rust project. Its module layout in `crates/iceberg/src/lib.rs` exposes public modules for `spec`, `arrow`, `writer`, `scan`, `io`, `expr`, `transaction`, and others. The crate depends directly on the `parquet` crate (with the `async` feature) and on the `arrow-*` crates. It has no feature flags today. + +For data file writing, the crate provides a three-layer architecture described in `crates/iceberg/src/writer/mod.rs`: + +1. **`FileWriter` / `FileWriterBuilder`** in `writer/file_writer/mod.rs`: traits for physical file writers, generic over an output type (defaulting to `Vec<DataFileBuilder>`). `FileWriterBuilder::build` takes an `OutputFile` and returns a `FileWriter`. `FileWriter::write` takes a `&RecordBatch`. + +2. **`IcebergWriter` / `IcebergWriterBuilder`** in `writer/mod.rs`: traits for logical Iceberg writers (data files, equality deletes, position deletes, partitioning), defaulting to `RecordBatch` input and `Vec<DataFile>` output. + +3. **Concrete implementations**: `ParquetWriterBuilder` and `ParquetWriter` in `writer/file_writer/parquet_writer.rs` are the only `FileWriter` implementation. Higher-level writers such as `DataFileWriterBuilder` (`writer/base_writer/data_file_writer.rs`) and `EqualityDeleteFileWriterBuilder` (`writer/base_writer/equality_delete_writer.rs`) are generic over any `FileWriterBuilder`, but every example, test, and integration instantiates them with `ParquetWriterBuilder`. + +The trait layer is format-agnostic. Every concrete instantiation uses Parquet. + +For data file reading, the crate provides `ArrowReaderBuilder` and `ArrowReader` in `arrow/reader.rs`. Both are Parquet-specific despite the generic name. `ArrowReader::process_file_scan_task` calls `open_parquet_file` directly, constructs a `ParquetRecordBatchReaderBuilder`, and uses Parquet-specific row group filtering and page index logic. `TableScan::to_arrow` in `scan/mod.rs` wires `ArrowReaderBuilder` in as the only reader path. `FileScanTask` carries a `data_file_format: DataFileFormat` field, but `process_file_scan_task` ignores it. `DataFileFormat` appears throughout `src/`. Almost every non-test reference is `DataFileFormat::Parquet`. The only non-Parquet references are two uses of `DataFileFormat::Avro` in `transaction/snapshot.rs` for manifest files. + +The `DataFileFormat` enum in `spec/manifest/data_file.rs` has four variants: `Avro`, `Orc`, `Parquet`, and `Puffin`. Puffin files hold statistics and deletion vectors rather than row data. They are handled in `crates/iceberg/src/puffin/` and are out of scope for this RFC (see Non-Goals). Data file support today: + +| Format | Data file read | Data file write | Manifests | +|---------|----------------|-----------------|-----------| +| Parquet | Yes | Yes | No | +| Avro | No | No | Yes | +| ORC | No | No | No | + +A table containing ORC or Avro data files cannot be read from Rust today, even though both are valid per the Iceberg spec. + +Arrow `RecordBatch` is the only in-memory data representation in the crate. It is the input type for every writer trait and the output type for every reader. The `arrow/` module provides schema conversion (`schema_to_arrow_schema`, `arrow_schema_to_schema`), `RecordBatchTransformer` for schema evolution and constant column injection, and the Parquet-to-Arrow read path. Every integration crate, including `iceberg-datafusion`, consumes Arrow. The crate does not define a generic `Record` type and does not integrate with engine-specific row types such as Java's `InternalRow` or `RowData`. + +### Pain points + +1. **No extension point for new formats.** Adding ORC means editing `ArrowReader` to branch on `DataFileFormat` and threading ORC-specific logic through every layer that touches it. The write path has the same shape. + +2. **Parquet assumptions leak into generic code.** `ArrowReaderBuilder` exposes Parquet-specific options (`with_metadata_size_hint`, `with_row_group_filtering_enabled`, `with_row_selection_enabled`) that are meaningless for ORC or Avro. The name "ArrowReader" conflates the in-memory representation (Arrow) with the on-disk format (Parquet). + +3. **No format-agnostic statistics.** `ParquetWriter` computes column sizes, value counts, and min/max bounds through `MinMaxColAggregator` and `NanValueCountVisitor`, both tightly coupled to Parquet's `Statistics` type. Another format cannot produce comparable manifest metadata without a shared statistics interface. + +4. **V3 types will need per-format serialization.** The V3 spec adds Variant and Geometry types. Each format encodes them differently: Parquet uses variant shredding, ORC uses binary, Avro uses union types. Implementing either type without a format abstraction means new `match` arms in every reader and writer code path. + +### Prior work + +The Java project shipped `FormatModel<D, S>` in February 2026 (PR [#12774](https://github.com/apache/iceberg/pull/12774)) after a 10-month review. Implementations for Parquet, Avro, ORC, and Arrow followed in PRs [#15253](https://github.com/apache/iceberg/pull/15253) through [#15258](https://github.com/apache/iceberg/pull/15258). Engine migrations for Spark ([#15328](https://github.com/apache/iceberg/pull/15328)) and Flink ([#15329](https://github.com/apache/iceberg/pull/15329)) landed shortly after. + +Java's `FormatModel` carries two generic parameters: a data type `D` (Java uses `Record`, Spark `InternalRow`, Flink `RowData`, and Arrow `ColumnarBatch`) and an engine schema type `S` (Spark `StructType`, Flink `RowType`, and others). A static `FormatModelRegistry` stores implementations keyed by `Pair<FileFormat, Class<?>>` and populates itself through Java reflection at startup. The old `Parquet.WriteBuilder`, `Avro.WriteBuilder`, and `ORC.WriteBuilder` are not deprecated. The new `FormatModel` implementations wrap them rather than replace them. + +PyIceberg has an open proposal for the equivalent capability in issue [apache/iceberg-python#3100](https://github.com/apache/iceberg-python/issues/3100), with an in-progress PR at [apache/iceberg-python#3119](https://github.com/apache/iceberg-python/pull/3119). Because PyIceberg uses PyArrow as its only in-memory representation, the proposal drops Java's generic type parameters and keys the registry on file format alone. Two prior PyIceberg ORC PRs ([#790](https://github.com/apache/iceberg-python/pull/790), [#2236](https://github.com/apache/iceberg-python/pull/2236)) were closed as stale without merging, which reinforces the case for landing an abstraction layer before adding new formats. + +Design decisions in this RFC that differ from Java (no generics, single-dimension registry key, hard cutover from the old Parquet types) are justified inline where they appear. Specific Java design points that shaped those decisions are called out in Alternatives Considered. + +## Goals + +The user-facing outcome of this proposal is that every Iceberg data file and delete file flows through the same stable and extensible API. Parquet is the first format to land. ORC, Avro, and others follow on the same interface. Every goal below serves that outcome. + +1. Define a `FormatModel` trait that encapsulates format-specific read and write behavior independent of the on-disk format. + +2. Remove hard-coded Parquet assumptions from scan and write orchestration. After this work, `TableScan::to_arrow` and `DataFileWriterBuilder` dispatch through the format abstraction instead of constructing Parquet types directly. + +3. Provide a registry that maps `DataFileFormat` values to `FormatModel` implementations, so callers obtain readers and writers without naming the concrete format type. + +4. Define a conformance test suite (TCK) that any `FormatModel` implementation must pass before it merges. + +5. Match the Java and PyIceberg designs where they align, and diverge where Rust's single-data-type ecosystem and pre-1.0 status justify it. Divergences are called out inline. + +## Non-Goals + +The items below are deliberately out of scope to keep this proposal focused on the abstraction and its Parquet implementation. Most are follow-up work that the API enables but does not itself deliver. + +1. **Ship new format implementations.** This RFC lands the abstraction and a Parquet implementation. ORC, Avro data-file, Vortex, and Lance are follow-up RFCs. + +2. **Introduce a plugin protocol or runtime library loading.** Rust does not offer a clean mechanism for loading compiled plugins at runtime. A runtime-linking approach using `libloading` or similar would expand scope beyond what the formats currently under discussion (Parquet, ORC, Avro, Vortex, Lance) require. + +3. **Add Puffin support to the FormatModel.** Puffin files hold statistics and deletion vectors rather than row data. They have a different lifecycle from data files and are already handled separately in `crates/iceberg/src/puffin/`. + +4. **Redesign the writer trait hierarchy.** The existing `IcebergWriter` and `FileWriter` layering is sound. This RFC adds a format abstraction beneath `FileWriter`, not a replacement for it. + +5. **Implement variant shredding or encryption.** Java exposes `engineProjection` and `engineSchema` as extension points for variant shredding and similar format-specific type mapping, and `withFileEncryptionKey` and `withAADPrefix` for Parquet encryption. Equivalent hooks are noted as future extensions in the Rust design. Implementing either requires a dedicated RFC. + +6. **Change the Iceberg table spec.** This proposal is a Rust-only API change. It does not modify the Iceberg spec, the manifest format, the manifest list format, or the on-disk layout of any file. + +7. **Modify manifest read or write paths.** Manifests and manifest lists remain in Avro and are handled by the existing `ManifestReader` and `ManifestWriter` paths. The File Format API is about data files and delete files only. + + +## Design + +The Rust API is three traits and a registry. `FormatModel` is the trait that each format implementation provides. `FormatReadBuilder` and `FormatWriteBuilder` are the per-operation configurators that `FormatModel` returns. `FormatRegistry` maps `DataFileFormat` values to `FormatModel` instances. None of the traits carry generic parameters. The subsections below introduce each type, and a final "Design rationale" subsection explains the choices. + +### The FormatModel trait + +```rust +pub trait FormatModel: Send + Sync + 'static { + fn format(&self) -> DataFileFormat; + fn read_builder(&self, input: InputFile) -> Box<dyn FormatReadBuilder>; + fn write_builder(&self, output: OutputFile) -> Box<dyn FormatWriteBuilder>; +} +``` + +Each implementation registers one instance per `DataFileFormat` variant it supports. The `format` method returns that variant. `read_builder` and `write_builder` are the entry points for reading and writing a file. Both return trait objects so that the registry can hand them back from a `DataFileFormat`-keyed lookup. + +The data type is fixed to Arrow `RecordBatch`, the Iceberg schema type to `iceberg::spec::Schema`, and the physical schema type to `arrow::datatypes::SchemaRef`. These are the only types in iceberg-rust today that would fill the roles Java uses generic parameters for. Arguments for keeping them fixed, including comparisons with Java's `FormatModel<D, S>`, are in "Design rationale" below. + +### The read and write builders + +`FormatReadBuilder` and `FormatWriteBuilder` configure a single read or write. `FormatModel` produces them, and the caller consumes them with `build`. + +```rust +pub trait FormatReadBuilder: Send { + fn project(&mut self, schema: Schema) -> &mut Self; + fn filter(&mut self, predicate: BoundPredicate) -> &mut Self; + fn split(&mut self, start: u64, length: u64) -> &mut Self; + fn case_sensitive(&mut self, case_sensitive: bool) -> &mut Self; + fn batch_size(&mut self, batch_size: usize) -> &mut Self; + fn build(self: Box<Self>) -> BoxFuture<'static, Result<ArrowRecordBatchStream>>; +} + +pub trait FormatWriteBuilder: Send { + fn schema(&mut self, schema: Schema) -> &mut Self; + fn set(&mut self, key: String, value: String) -> &mut Self; + fn build(self: Box<Self>) -> BoxFuture<'static, Result<Box<dyn FormatFileWriter>>>; +} +``` + +Both builders take Iceberg `Schema` values. Format implementations convert to physical schemas internally using `schema_to_arrow_schema` from `arrow/schema.rs`. This deliberately departs from Java's `ReadBuilder<D, S>`, which has a separate `engineProjection(S)` method for variant shredding and similar per-engine type mapping. Rust's builders expose one projection surface, and the hook for a variant-shredding "engine projection" is a future-extension point described in "Design rationale." + +`FormatWriteBuilder::build` produces a `FormatFileWriter`: + +```rust +pub trait FormatFileWriter: Send { + fn write(&mut self, batch: &RecordBatch) -> BoxFuture<'_, Result<()>>; + fn close(self: Box<Self>) -> BoxFuture<'static, Result<Vec<DataFileBuilder>>>; +} +``` + +The async methods return `BoxFuture` rather than using `async fn` in traits. The `self: Box<Self>` signature on `build` and `close` lets those methods consume the value while keeping the traits object-safe. Both patterns are forced by the trait-object boundary at the registry. The "BoxFuture instead of async fn in traits" and "Dynamic dispatch at the registry, static inside the format" subsections under "Design rationale" explain why. + +### The FormatRegistry + +`FormatRegistry` maps `DataFileFormat` values to `FormatModel` instances. + +```rust +pub struct FormatRegistry { + models: HashMap<DataFileFormat, Box<dyn FormatModel>>, +} + +impl FormatRegistry { + pub fn new() -> Self { ... } + pub fn register(&mut self, model: Box<dyn FormatModel>) { ... } + pub fn read_builder( + &self, + format: DataFileFormat, + input: InputFile, + ) -> Result<Box<dyn FormatReadBuilder>> { ... } + pub fn write_builder( + &self, + format: DataFileFormat, + output: OutputFile, + ) -> Result<Box<dyn FormatWriteBuilder>> { ... } +} +``` + +Using `DataFileFormat` as a `HashMap` key requires adding `#[derive(Hash)]` to the enum. That is a non-breaking addition. + +The registry is an owned value, not a global static. Tests construct their own. Applications construct one at startup and pass it to scan planners and write orchestrators. For the common case of a single registry for the lifetime of a process, `default_format_registry()` returns a `&'static FormatRegistry` initialized through `OnceLock` on first call. Review Comment: Or we can inject format registry to Catalog instances directly? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
