Re: [PR] rfc: Implement an API for all Data File Formats [iceberg-rust]

via GitHub Wed, 29 Apr 2026 16:31:02 -0700


CTTY commented on code in PR #2384:
URL: https://github.com/apache/iceberg-rust/pull/2384#discussion_r3164586355



##########
docs/rfcs/0003_file_format_api.md:
##########
@@ -0,0 +1,520 @@
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+-->
+
+# RFC: File Format API for Apache Iceberg Rust
+
+## Background
+
+### Current state
+
+The `iceberg` crate (version 0.9.0, Rust 1.92 as of this writing) is the core 
library of the Apache Iceberg Rust project. Its module layout in 
`crates/iceberg/src/lib.rs` exposes public modules for `spec`, `arrow`, 
`writer`, `scan`, `io`, `expr`, `transaction`, and others. The crate depends 
directly on the `parquet` crate (with the `async` feature) and on the `arrow-*` 
crates. It has no feature flags today.
+
+For data file writing, the crate provides a three-layer architecture described 
in `crates/iceberg/src/writer/mod.rs`:
+
+1. **`FileWriter` / `FileWriterBuilder`** in `writer/file_writer/mod.rs`: 
traits for physical file writers, generic over an output type (defaulting to 
`Vec<DataFileBuilder>`). `FileWriterBuilder::build` takes an `OutputFile` and 
returns a `FileWriter`. `FileWriter::write` takes a `&RecordBatch`.
+
+2. **`IcebergWriter` / `IcebergWriterBuilder`** in `writer/mod.rs`: traits for 
logical Iceberg writers (data files, equality deletes, position deletes, 
partitioning), defaulting to `RecordBatch` input and `Vec<DataFile>` output.
+
+3. **Concrete implementations**: `ParquetWriterBuilder` and `ParquetWriter` in 
`writer/file_writer/parquet_writer.rs` are the only `FileWriter` 
implementation. Higher-level writers such as `DataFileWriterBuilder` 
(`writer/base_writer/data_file_writer.rs`) and 
`EqualityDeleteFileWriterBuilder` 
(`writer/base_writer/equality_delete_writer.rs`) are generic over any 
`FileWriterBuilder`, but every example, test, and integration instantiates them 
with `ParquetWriterBuilder`.
+
+The trait layer is format-agnostic. Every concrete instantiation uses Parquet.
+
+For data file reading, the crate provides `ArrowReaderBuilder` and 
`ArrowReader` in `arrow/reader.rs`. Both are Parquet-specific despite the 
generic name. `ArrowReader::process_file_scan_task` calls `open_parquet_file` 
directly, constructs a `ParquetRecordBatchReaderBuilder`, and uses 
Parquet-specific row group filtering and page index logic. 
`TableScan::to_arrow` in `scan/mod.rs` wires `ArrowReaderBuilder` in as the 
only reader path. `FileScanTask` carries a `data_file_format: DataFileFormat` 
field, but `process_file_scan_task` ignores it. `DataFileFormat` appears 
throughout `src/`. Almost every non-test reference is 
`DataFileFormat::Parquet`. The only non-Parquet references are two uses of 
`DataFileFormat::Avro` in `transaction/snapshot.rs` for manifest files.
+
+The `DataFileFormat` enum in `spec/manifest/data_file.rs` has four variants: 
`Avro`, `Orc`, `Parquet`, and `Puffin`. Puffin files hold statistics and 
deletion vectors rather than row data. They are handled in 
`crates/iceberg/src/puffin/` and are out of scope for this RFC (see Non-Goals). 
Data file support today:
+
+| Format  | Data file read | Data file write | Manifests |
+|---------|----------------|-----------------|-----------|
+| Parquet | Yes            | Yes             | No        |
+| Avro    | No             | No              | Yes       |
+| ORC     | No             | No              | No        |
+
+A table containing ORC or Avro data files cannot be read from Rust today, even 
though both are valid per the Iceberg spec.
+
+Arrow `RecordBatch` is the only in-memory data representation in the crate. It 
is the input type for every writer trait and the output type for every reader. 
The `arrow/` module provides schema conversion (`schema_to_arrow_schema`, 
`arrow_schema_to_schema`), `RecordBatchTransformer` for schema evolution and 
constant column injection, and the Parquet-to-Arrow read path. Every 
integration crate, including `iceberg-datafusion`, consumes Arrow. The crate 
does not define a generic `Record` type and does not integrate with 
engine-specific row types such as Java's `InternalRow` or `RowData`.
+
+### Pain points
+
+1. **No extension point for new formats.** Adding ORC means editing 
`ArrowReader` to branch on `DataFileFormat` and threading ORC-specific logic 
through every layer that touches it. The write path has the same shape.
+
+2. **Parquet assumptions leak into generic code.** `ArrowReaderBuilder` 
exposes Parquet-specific options (`with_metadata_size_hint`, 
`with_row_group_filtering_enabled`, `with_row_selection_enabled`) that are 
meaningless for ORC or Avro. The name "ArrowReader" conflates the in-memory 
representation (Arrow) with the on-disk format (Parquet).
+
+3. **No format-agnostic statistics.** `ParquetWriter` computes column sizes, 
value counts, and min/max bounds through `MinMaxColAggregator` and 
`NanValueCountVisitor`, both tightly coupled to Parquet's `Statistics` type. 
Another format cannot produce comparable manifest metadata without a shared 
statistics interface.
+
+4. **V3 types will need per-format serialization.** The V3 spec adds Variant 
and Geometry types. Each format encodes them differently: Parquet uses variant 
shredding, ORC uses binary, Avro uses union types. Implementing either type 
without a format abstraction means new `match` arms in every reader and writer 
code path.
+
+### Prior work
+
+The Java project shipped `FormatModel<D, S>` in February 2026 (PR 
[#12774](https://github.com/apache/iceberg/pull/12774)) after a 10-month 
review. Implementations for Parquet, Avro, ORC, and Arrow followed in PRs 
[#15253](https://github.com/apache/iceberg/pull/15253) through 
[#15258](https://github.com/apache/iceberg/pull/15258). Engine migrations for 
Spark ([#15328](https://github.com/apache/iceberg/pull/15328)) and Flink 
([#15329](https://github.com/apache/iceberg/pull/15329)) landed shortly after.
+
+Java's `FormatModel` carries two generic parameters: a data type `D` (Java 
uses `Record`, Spark `InternalRow`, Flink `RowData`, and Arrow `ColumnarBatch`) 
and an engine schema type `S` (Spark `StructType`, Flink `RowType`, and 
others). A static `FormatModelRegistry` stores implementations keyed by 
`Pair<FileFormat, Class<?>>` and populates itself through Java reflection at 
startup. The old `Parquet.WriteBuilder`, `Avro.WriteBuilder`, and 
`ORC.WriteBuilder` are not deprecated. The new `FormatModel` implementations 
wrap them rather than replace them.
+
+PyIceberg has an open proposal for the equivalent capability in issue 
[apache/iceberg-python#3100](https://github.com/apache/iceberg-python/issues/3100),
 with an in-progress PR at 
[apache/iceberg-python#3119](https://github.com/apache/iceberg-python/pull/3119).
 Because PyIceberg uses PyArrow as its only in-memory representation, the 
proposal drops Java's generic type parameters and keys the registry on file 
format alone. Two prior PyIceberg ORC PRs 
([#790](https://github.com/apache/iceberg-python/pull/790), 
[#2236](https://github.com/apache/iceberg-python/pull/2236)) were closed as 
stale without merging, which reinforces the case for landing an abstraction 
layer before adding new formats.
+
+Design decisions in this RFC that differ from Java (no generics, 
single-dimension registry key, hard cutover from the old Parquet types) are 
justified inline where they appear. Specific Java design points that shaped 
those decisions are called out in Alternatives Considered.
+
+## Goals
+
+The user-facing outcome of this proposal is that every Iceberg data file and 
delete file flows through the same stable and extensible API. Parquet is the 
first format to land. ORC, Avro, and others follow on the same interface. Every 
goal below serves that outcome.
+
+1. Define a `FormatModel` trait that encapsulates format-specific read and 
write behavior independent of the on-disk format.
+
+2. Remove hard-coded Parquet assumptions from scan and write orchestration. 
After this work, `TableScan::to_arrow` and `DataFileWriterBuilder` dispatch 
through the format abstraction instead of constructing Parquet types directly.
+
+3. Provide a registry that maps `DataFileFormat` values to `FormatModel` 
implementations, so callers obtain readers and writers without naming the 
concrete format type.

Review Comment:
   I think this is more of an implementation details



##########
docs/rfcs/0003_file_format_api.md:
##########
@@ -0,0 +1,520 @@
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+-->
+
+# RFC: File Format API for Apache Iceberg Rust
+
+## Background
+
+### Current state
+
+The `iceberg` crate (version 0.9.0, Rust 1.92 as of this writing) is the core 
library of the Apache Iceberg Rust project. Its module layout in 
`crates/iceberg/src/lib.rs` exposes public modules for `spec`, `arrow`, 
`writer`, `scan`, `io`, `expr`, `transaction`, and others. The crate depends 
directly on the `parquet` crate (with the `async` feature) and on the `arrow-*` 
crates. It has no feature flags today.
+
+For data file writing, the crate provides a three-layer architecture described 
in `crates/iceberg/src/writer/mod.rs`:
+
+1. **`FileWriter` / `FileWriterBuilder`** in `writer/file_writer/mod.rs`: 
traits for physical file writers, generic over an output type (defaulting to 
`Vec<DataFileBuilder>`). `FileWriterBuilder::build` takes an `OutputFile` and 
returns a `FileWriter`. `FileWriter::write` takes a `&RecordBatch`.
+
+2. **`IcebergWriter` / `IcebergWriterBuilder`** in `writer/mod.rs`: traits for 
logical Iceberg writers (data files, equality deletes, position deletes, 
partitioning), defaulting to `RecordBatch` input and `Vec<DataFile>` output.
+
+3. **Concrete implementations**: `ParquetWriterBuilder` and `ParquetWriter` in 
`writer/file_writer/parquet_writer.rs` are the only `FileWriter` 
implementation. Higher-level writers such as `DataFileWriterBuilder` 
(`writer/base_writer/data_file_writer.rs`) and 
`EqualityDeleteFileWriterBuilder` 
(`writer/base_writer/equality_delete_writer.rs`) are generic over any 
`FileWriterBuilder`, but every example, test, and integration instantiates them 
with `ParquetWriterBuilder`.
+
+The trait layer is format-agnostic. Every concrete instantiation uses Parquet.
+
+For data file reading, the crate provides `ArrowReaderBuilder` and 
`ArrowReader` in `arrow/reader.rs`. Both are Parquet-specific despite the 
generic name. `ArrowReader::process_file_scan_task` calls `open_parquet_file` 
directly, constructs a `ParquetRecordBatchReaderBuilder`, and uses 
Parquet-specific row group filtering and page index logic. 
`TableScan::to_arrow` in `scan/mod.rs` wires `ArrowReaderBuilder` in as the 
only reader path. `FileScanTask` carries a `data_file_format: DataFileFormat` 
field, but `process_file_scan_task` ignores it. `DataFileFormat` appears 
throughout `src/`. Almost every non-test reference is 
`DataFileFormat::Parquet`. The only non-Parquet references are two uses of 
`DataFileFormat::Avro` in `transaction/snapshot.rs` for manifest files.
+
+The `DataFileFormat` enum in `spec/manifest/data_file.rs` has four variants: 
`Avro`, `Orc`, `Parquet`, and `Puffin`. Puffin files hold statistics and 
deletion vectors rather than row data. They are handled in 
`crates/iceberg/src/puffin/` and are out of scope for this RFC (see Non-Goals). 
Data file support today:
+
+| Format  | Data file read | Data file write | Manifests |
+|---------|----------------|-----------------|-----------|
+| Parquet | Yes            | Yes             | No        |
+| Avro    | No             | No              | Yes       |
+| ORC     | No             | No              | No        |
+
+A table containing ORC or Avro data files cannot be read from Rust today, even 
though both are valid per the Iceberg spec.
+
+Arrow `RecordBatch` is the only in-memory data representation in the crate. It 
is the input type for every writer trait and the output type for every reader. 
The `arrow/` module provides schema conversion (`schema_to_arrow_schema`, 
`arrow_schema_to_schema`), `RecordBatchTransformer` for schema evolution and 
constant column injection, and the Parquet-to-Arrow read path. Every 
integration crate, including `iceberg-datafusion`, consumes Arrow. The crate 
does not define a generic `Record` type and does not integrate with 
engine-specific row types such as Java's `InternalRow` or `RowData`.
+
+### Pain points
+
+1. **No extension point for new formats.** Adding ORC means editing 
`ArrowReader` to branch on `DataFileFormat` and threading ORC-specific logic 
through every layer that touches it. The write path has the same shape.
+
+2. **Parquet assumptions leak into generic code.** `ArrowReaderBuilder` 
exposes Parquet-specific options (`with_metadata_size_hint`, 
`with_row_group_filtering_enabled`, `with_row_selection_enabled`) that are 
meaningless for ORC or Avro. The name "ArrowReader" conflates the in-memory 
representation (Arrow) with the on-disk format (Parquet).
+
+3. **No format-agnostic statistics.** `ParquetWriter` computes column sizes, 
value counts, and min/max bounds through `MinMaxColAggregator` and 
`NanValueCountVisitor`, both tightly coupled to Parquet's `Statistics` type. 
Another format cannot produce comparable manifest metadata without a shared 
statistics interface.
+
+4. **V3 types will need per-format serialization.** The V3 spec adds Variant 
and Geometry types. Each format encodes them differently: Parquet uses variant 
shredding, ORC uses binary, Avro uses union types. Implementing either type 
without a format abstraction means new `match` arms in every reader and writer 
code path.
+
+### Prior work
+
+The Java project shipped `FormatModel<D, S>` in February 2026 (PR 
[#12774](https://github.com/apache/iceberg/pull/12774)) after a 10-month 
review. Implementations for Parquet, Avro, ORC, and Arrow followed in PRs 
[#15253](https://github.com/apache/iceberg/pull/15253) through 
[#15258](https://github.com/apache/iceberg/pull/15258). Engine migrations for 
Spark ([#15328](https://github.com/apache/iceberg/pull/15328)) and Flink 
([#15329](https://github.com/apache/iceberg/pull/15329)) landed shortly after.
+
+Java's `FormatModel` carries two generic parameters: a data type `D` (Java 
uses `Record`, Spark `InternalRow`, Flink `RowData`, and Arrow `ColumnarBatch`) 
and an engine schema type `S` (Spark `StructType`, Flink `RowType`, and 
others). A static `FormatModelRegistry` stores implementations keyed by 
`Pair<FileFormat, Class<?>>` and populates itself through Java reflection at 
startup. The old `Parquet.WriteBuilder`, `Avro.WriteBuilder`, and 
`ORC.WriteBuilder` are not deprecated. The new `FormatModel` implementations 
wrap them rather than replace them.
+
+PyIceberg has an open proposal for the equivalent capability in issue 
[apache/iceberg-python#3100](https://github.com/apache/iceberg-python/issues/3100),
 with an in-progress PR at 
[apache/iceberg-python#3119](https://github.com/apache/iceberg-python/pull/3119).
 Because PyIceberg uses PyArrow as its only in-memory representation, the 
proposal drops Java's generic type parameters and keys the registry on file 
format alone. Two prior PyIceberg ORC PRs 
([#790](https://github.com/apache/iceberg-python/pull/790), 
[#2236](https://github.com/apache/iceberg-python/pull/2236)) were closed as 
stale without merging, which reinforces the case for landing an abstraction 
layer before adding new formats.
+
+Design decisions in this RFC that differ from Java (no generics, 
single-dimension registry key, hard cutover from the old Parquet types) are 
justified inline where they appear. Specific Java design points that shaped 
those decisions are called out in Alternatives Considered.
+
+## Goals
+
+The user-facing outcome of this proposal is that every Iceberg data file and 
delete file flows through the same stable and extensible API. Parquet is the 
first format to land. ORC, Avro, and others follow on the same interface. Every 
goal below serves that outcome.
+
+1. Define a `FormatModel` trait that encapsulates format-specific read and 
write behavior independent of the on-disk format.
+
+2. Remove hard-coded Parquet assumptions from scan and write orchestration. 
After this work, `TableScan::to_arrow` and `DataFileWriterBuilder` dispatch 
through the format abstraction instead of constructing Parquet types directly.
+
+3. Provide a registry that maps `DataFileFormat` values to `FormatModel` 
implementations, so callers obtain readers and writers without naming the 
concrete format type.
+
+4. Define a conformance test suite (TCK) that any `FormatModel` implementation 
must pass before it merges.

Review Comment:
   Is this really necessary for the initial implementation?  we only support 
parquet for now
   
   TCK is not even completed on the java side afaik



##########
docs/rfcs/0003_file_format_api.md:
##########
@@ -0,0 +1,520 @@
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+-->
+
+# RFC: File Format API for Apache Iceberg Rust
+
+## Background
+
+### Current state
+
+The `iceberg` crate (version 0.9.0, Rust 1.92 as of this writing) is the core 
library of the Apache Iceberg Rust project. Its module layout in 
`crates/iceberg/src/lib.rs` exposes public modules for `spec`, `arrow`, 
`writer`, `scan`, `io`, `expr`, `transaction`, and others. The crate depends 
directly on the `parquet` crate (with the `async` feature) and on the `arrow-*` 
crates. It has no feature flags today.
+
+For data file writing, the crate provides a three-layer architecture described 
in `crates/iceberg/src/writer/mod.rs`:
+
+1. **`FileWriter` / `FileWriterBuilder`** in `writer/file_writer/mod.rs`: 
traits for physical file writers, generic over an output type (defaulting to 
`Vec<DataFileBuilder>`). `FileWriterBuilder::build` takes an `OutputFile` and 
returns a `FileWriter`. `FileWriter::write` takes a `&RecordBatch`.
+
+2. **`IcebergWriter` / `IcebergWriterBuilder`** in `writer/mod.rs`: traits for 
logical Iceberg writers (data files, equality deletes, position deletes, 
partitioning), defaulting to `RecordBatch` input and `Vec<DataFile>` output.
+
+3. **Concrete implementations**: `ParquetWriterBuilder` and `ParquetWriter` in 
`writer/file_writer/parquet_writer.rs` are the only `FileWriter` 
implementation. Higher-level writers such as `DataFileWriterBuilder` 
(`writer/base_writer/data_file_writer.rs`) and 
`EqualityDeleteFileWriterBuilder` 
(`writer/base_writer/equality_delete_writer.rs`) are generic over any 
`FileWriterBuilder`, but every example, test, and integration instantiates them 
with `ParquetWriterBuilder`.
+
+The trait layer is format-agnostic. Every concrete instantiation uses Parquet.
+
+For data file reading, the crate provides `ArrowReaderBuilder` and 
`ArrowReader` in `arrow/reader.rs`. Both are Parquet-specific despite the 
generic name. `ArrowReader::process_file_scan_task` calls `open_parquet_file` 
directly, constructs a `ParquetRecordBatchReaderBuilder`, and uses 
Parquet-specific row group filtering and page index logic. 
`TableScan::to_arrow` in `scan/mod.rs` wires `ArrowReaderBuilder` in as the 
only reader path. `FileScanTask` carries a `data_file_format: DataFileFormat` 
field, but `process_file_scan_task` ignores it. `DataFileFormat` appears 
throughout `src/`. Almost every non-test reference is 
`DataFileFormat::Parquet`. The only non-Parquet references are two uses of 
`DataFileFormat::Avro` in `transaction/snapshot.rs` for manifest files.
+
+The `DataFileFormat` enum in `spec/manifest/data_file.rs` has four variants: 
`Avro`, `Orc`, `Parquet`, and `Puffin`. Puffin files hold statistics and 
deletion vectors rather than row data. They are handled in 
`crates/iceberg/src/puffin/` and are out of scope for this RFC (see Non-Goals). 
Data file support today:
+
+| Format  | Data file read | Data file write | Manifests |
+|---------|----------------|-----------------|-----------|
+| Parquet | Yes            | Yes             | No        |
+| Avro    | No             | No              | Yes       |
+| ORC     | No             | No              | No        |
+
+A table containing ORC or Avro data files cannot be read from Rust today, even 
though both are valid per the Iceberg spec.
+
+Arrow `RecordBatch` is the only in-memory data representation in the crate. It 
is the input type for every writer trait and the output type for every reader. 
The `arrow/` module provides schema conversion (`schema_to_arrow_schema`, 
`arrow_schema_to_schema`), `RecordBatchTransformer` for schema evolution and 
constant column injection, and the Parquet-to-Arrow read path. Every 
integration crate, including `iceberg-datafusion`, consumes Arrow. The crate 
does not define a generic `Record` type and does not integrate with 
engine-specific row types such as Java's `InternalRow` or `RowData`.
+
+### Pain points
+
+1. **No extension point for new formats.** Adding ORC means editing 
`ArrowReader` to branch on `DataFileFormat` and threading ORC-specific logic 
through every layer that touches it. The write path has the same shape.
+
+2. **Parquet assumptions leak into generic code.** `ArrowReaderBuilder` 
exposes Parquet-specific options (`with_metadata_size_hint`, 
`with_row_group_filtering_enabled`, `with_row_selection_enabled`) that are 
meaningless for ORC or Avro. The name "ArrowReader" conflates the in-memory 
representation (Arrow) with the on-disk format (Parquet).
+
+3. **No format-agnostic statistics.** `ParquetWriter` computes column sizes, 
value counts, and min/max bounds through `MinMaxColAggregator` and 
`NanValueCountVisitor`, both tightly coupled to Parquet's `Statistics` type. 
Another format cannot produce comparable manifest metadata without a shared 
statistics interface.
+
+4. **V3 types will need per-format serialization.** The V3 spec adds Variant 
and Geometry types. Each format encodes them differently: Parquet uses variant 
shredding, ORC uses binary, Avro uses union types. Implementing either type 
without a format abstraction means new `match` arms in every reader and writer 
code path.
+
+### Prior work
+
+The Java project shipped `FormatModel<D, S>` in February 2026 (PR 
[#12774](https://github.com/apache/iceberg/pull/12774)) after a 10-month 
review. Implementations for Parquet, Avro, ORC, and Arrow followed in PRs 
[#15253](https://github.com/apache/iceberg/pull/15253) through 
[#15258](https://github.com/apache/iceberg/pull/15258). Engine migrations for 
Spark ([#15328](https://github.com/apache/iceberg/pull/15328)) and Flink 
([#15329](https://github.com/apache/iceberg/pull/15329)) landed shortly after.
+
+Java's `FormatModel` carries two generic parameters: a data type `D` (Java 
uses `Record`, Spark `InternalRow`, Flink `RowData`, and Arrow `ColumnarBatch`) 
and an engine schema type `S` (Spark `StructType`, Flink `RowType`, and 
others). A static `FormatModelRegistry` stores implementations keyed by 
`Pair<FileFormat, Class<?>>` and populates itself through Java reflection at 
startup. The old `Parquet.WriteBuilder`, `Avro.WriteBuilder`, and 
`ORC.WriteBuilder` are not deprecated. The new `FormatModel` implementations 
wrap them rather than replace them.
+
+PyIceberg has an open proposal for the equivalent capability in issue 
[apache/iceberg-python#3100](https://github.com/apache/iceberg-python/issues/3100),
 with an in-progress PR at 
[apache/iceberg-python#3119](https://github.com/apache/iceberg-python/pull/3119).
 Because PyIceberg uses PyArrow as its only in-memory representation, the 
proposal drops Java's generic type parameters and keys the registry on file 
format alone. Two prior PyIceberg ORC PRs 
([#790](https://github.com/apache/iceberg-python/pull/790), 
[#2236](https://github.com/apache/iceberg-python/pull/2236)) were closed as 
stale without merging, which reinforces the case for landing an abstraction 
layer before adding new formats.
+
+Design decisions in this RFC that differ from Java (no generics, 
single-dimension registry key, hard cutover from the old Parquet types) are 
justified inline where they appear. Specific Java design points that shaped 
those decisions are called out in Alternatives Considered.
+
+## Goals
+
+The user-facing outcome of this proposal is that every Iceberg data file and 
delete file flows through the same stable and extensible API. Parquet is the 
first format to land. ORC, Avro, and others follow on the same interface. Every 
goal below serves that outcome.
+
+1. Define a `FormatModel` trait that encapsulates format-specific read and 
write behavior independent of the on-disk format.
+
+2. Remove hard-coded Parquet assumptions from scan and write orchestration. 
After this work, `TableScan::to_arrow` and `DataFileWriterBuilder` dispatch 
through the format abstraction instead of constructing Parquet types directly.
+
+3. Provide a registry that maps `DataFileFormat` values to `FormatModel` 
implementations, so callers obtain readers and writers without naming the 
concrete format type.
+
+4. Define a conformance test suite (TCK) that any `FormatModel` implementation 
must pass before it merges.
+
+5. Match the Java and PyIceberg designs where they align, and diverge where 
Rust's single-data-type ecosystem and pre-1.0 status justify it. Divergences 
are called out inline.
+
+## Non-Goals
+
+The items below are deliberately out of scope to keep this proposal focused on 
the abstraction and its Parquet implementation. Most are follow-up work that 
the API enables but does not itself deliver.
+
+1. **Ship new format implementations.** This RFC lands the abstraction and a 
Parquet implementation. ORC, Avro data-file, Vortex, and Lance are follow-up 
RFCs.
+
+2. **Introduce a plugin protocol or runtime library loading.** Rust does not 
offer a clean mechanism for loading compiled plugins at runtime. A 
runtime-linking approach using `libloading` or similar would expand scope 
beyond what the formats currently under discussion (Parquet, ORC, Avro, Vortex, 
Lance) require.
+
+3. **Add Puffin support to the FormatModel.** Puffin files hold statistics and 
deletion vectors rather than row data. They have a different lifecycle from 
data files and are already handled separately in `crates/iceberg/src/puffin/`.
+
+4. **Redesign the writer trait hierarchy.** The existing `IcebergWriter` and 
`FileWriter` layering is sound. This RFC adds a format abstraction beneath 
`FileWriter`, not a replacement for it.
+
+5. **Implement variant shredding or encryption.** Java exposes 
`engineProjection` and `engineSchema` as extension points for variant shredding 
and similar format-specific type mapping, and `withFileEncryptionKey` and 
`withAADPrefix` for Parquet encryption. Equivalent hooks are noted as future 
extensions in the Rust design. Implementing either requires a dedicated RFC.
+
+6. **Change the Iceberg table spec.** This proposal is a Rust-only API change. 
It does not modify the Iceberg spec, the manifest format, the manifest list 
format, or the on-disk layout of any file.
+
+7. **Modify manifest read or write paths.** Manifests and manifest lists 
remain in Avro and are handled by the existing `ManifestReader` and 
`ManifestWriter` paths. The File Format API is about data files and delete 
files only.
+
+
+## Design
+
+The Rust API is three traits and a registry. `FormatModel` is the trait that 
each format implementation provides. `FormatReadBuilder` and 
`FormatWriteBuilder` are the per-operation configurators that `FormatModel` 
returns. `FormatRegistry` maps `DataFileFormat` values to `FormatModel` 
instances. None of the traits carry generic parameters. The subsections below 
introduce each type, and a final "Design rationale" subsection explains the 
choices.
+
+### The FormatModel trait
+
+```rust
+pub trait FormatModel: Send + Sync + 'static {
+    fn format(&self) -> DataFileFormat;
+    fn read_builder(&self, input: InputFile) -> Box<dyn FormatReadBuilder>;
+    fn write_builder(&self, output: OutputFile) -> Box<dyn FormatWriteBuilder>;
+}
+```
+
+Each implementation registers one instance per `DataFileFormat` variant it 
supports. The `format` method returns that variant. `read_builder` and 
`write_builder` are the entry points for reading and writing a file. Both 
return trait objects so that the registry can hand them back from a 
`DataFileFormat`-keyed lookup.
+
+The data type is fixed to Arrow `RecordBatch`, the Iceberg schema type to 
`iceberg::spec::Schema`, and the physical schema type to 
`arrow::datatypes::SchemaRef`. These are the only types in iceberg-rust today 
that would fill the roles Java uses generic parameters for. Arguments for 
keeping them fixed, including comparisons with Java's `FormatModel<D, S>`, are 
in "Design rationale" below.

Review Comment:
   We don't have to hardcode it to use Arrow's `RecordBatch` even. We can use a 
generic type for in-memory representation and arrow can be the default value



##########
docs/rfcs/0003_file_format_api.md:
##########
@@ -0,0 +1,520 @@
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+-->
+
+# RFC: File Format API for Apache Iceberg Rust
+
+## Background
+
+### Current state
+
+The `iceberg` crate (version 0.9.0, Rust 1.92 as of this writing) is the core 
library of the Apache Iceberg Rust project. Its module layout in 
`crates/iceberg/src/lib.rs` exposes public modules for `spec`, `arrow`, 
`writer`, `scan`, `io`, `expr`, `transaction`, and others. The crate depends 
directly on the `parquet` crate (with the `async` feature) and on the `arrow-*` 
crates. It has no feature flags today.
+
+For data file writing, the crate provides a three-layer architecture described 
in `crates/iceberg/src/writer/mod.rs`:
+
+1. **`FileWriter` / `FileWriterBuilder`** in `writer/file_writer/mod.rs`: 
traits for physical file writers, generic over an output type (defaulting to 
`Vec<DataFileBuilder>`). `FileWriterBuilder::build` takes an `OutputFile` and 
returns a `FileWriter`. `FileWriter::write` takes a `&RecordBatch`.
+
+2. **`IcebergWriter` / `IcebergWriterBuilder`** in `writer/mod.rs`: traits for 
logical Iceberg writers (data files, equality deletes, position deletes, 
partitioning), defaulting to `RecordBatch` input and `Vec<DataFile>` output.
+
+3. **Concrete implementations**: `ParquetWriterBuilder` and `ParquetWriter` in 
`writer/file_writer/parquet_writer.rs` are the only `FileWriter` 
implementation. Higher-level writers such as `DataFileWriterBuilder` 
(`writer/base_writer/data_file_writer.rs`) and 
`EqualityDeleteFileWriterBuilder` 
(`writer/base_writer/equality_delete_writer.rs`) are generic over any 
`FileWriterBuilder`, but every example, test, and integration instantiates them 
with `ParquetWriterBuilder`.
+
+The trait layer is format-agnostic. Every concrete instantiation uses Parquet.
+
+For data file reading, the crate provides `ArrowReaderBuilder` and 
`ArrowReader` in `arrow/reader.rs`. Both are Parquet-specific despite the 
generic name. `ArrowReader::process_file_scan_task` calls `open_parquet_file` 
directly, constructs a `ParquetRecordBatchReaderBuilder`, and uses 
Parquet-specific row group filtering and page index logic. 
`TableScan::to_arrow` in `scan/mod.rs` wires `ArrowReaderBuilder` in as the 
only reader path. `FileScanTask` carries a `data_file_format: DataFileFormat` 
field, but `process_file_scan_task` ignores it. `DataFileFormat` appears 
throughout `src/`. Almost every non-test reference is 
`DataFileFormat::Parquet`. The only non-Parquet references are two uses of 
`DataFileFormat::Avro` in `transaction/snapshot.rs` for manifest files.
+
+The `DataFileFormat` enum in `spec/manifest/data_file.rs` has four variants: 
`Avro`, `Orc`, `Parquet`, and `Puffin`. Puffin files hold statistics and 
deletion vectors rather than row data. They are handled in 
`crates/iceberg/src/puffin/` and are out of scope for this RFC (see Non-Goals). 
Data file support today:
+
+| Format  | Data file read | Data file write | Manifests |
+|---------|----------------|-----------------|-----------|
+| Parquet | Yes            | Yes             | No        |
+| Avro    | No             | No              | Yes       |
+| ORC     | No             | No              | No        |
+
+A table containing ORC or Avro data files cannot be read from Rust today, even 
though both are valid per the Iceberg spec.
+
+Arrow `RecordBatch` is the only in-memory data representation in the crate. It 
is the input type for every writer trait and the output type for every reader. 
The `arrow/` module provides schema conversion (`schema_to_arrow_schema`, 
`arrow_schema_to_schema`), `RecordBatchTransformer` for schema evolution and 
constant column injection, and the Parquet-to-Arrow read path. Every 
integration crate, including `iceberg-datafusion`, consumes Arrow. The crate 
does not define a generic `Record` type and does not integrate with 
engine-specific row types such as Java's `InternalRow` or `RowData`.
+
+### Pain points
+
+1. **No extension point for new formats.** Adding ORC means editing 
`ArrowReader` to branch on `DataFileFormat` and threading ORC-specific logic 
through every layer that touches it. The write path has the same shape.
+
+2. **Parquet assumptions leak into generic code.** `ArrowReaderBuilder` 
exposes Parquet-specific options (`with_metadata_size_hint`, 
`with_row_group_filtering_enabled`, `with_row_selection_enabled`) that are 
meaningless for ORC or Avro. The name "ArrowReader" conflates the in-memory 
representation (Arrow) with the on-disk format (Parquet).
+
+3. **No format-agnostic statistics.** `ParquetWriter` computes column sizes, 
value counts, and min/max bounds through `MinMaxColAggregator` and 
`NanValueCountVisitor`, both tightly coupled to Parquet's `Statistics` type. 
Another format cannot produce comparable manifest metadata without a shared 
statistics interface.
+
+4. **V3 types will need per-format serialization.** The V3 spec adds Variant 
and Geometry types. Each format encodes them differently: Parquet uses variant 
shredding, ORC uses binary, Avro uses union types. Implementing either type 
without a format abstraction means new `match` arms in every reader and writer 
code path.
+
+### Prior work
+
+The Java project shipped `FormatModel<D, S>` in February 2026 (PR 
[#12774](https://github.com/apache/iceberg/pull/12774)) after a 10-month 
review. Implementations for Parquet, Avro, ORC, and Arrow followed in PRs 
[#15253](https://github.com/apache/iceberg/pull/15253) through 
[#15258](https://github.com/apache/iceberg/pull/15258). Engine migrations for 
Spark ([#15328](https://github.com/apache/iceberg/pull/15328)) and Flink 
([#15329](https://github.com/apache/iceberg/pull/15329)) landed shortly after.
+
+Java's `FormatModel` carries two generic parameters: a data type `D` (Java 
uses `Record`, Spark `InternalRow`, Flink `RowData`, and Arrow `ColumnarBatch`) 
and an engine schema type `S` (Spark `StructType`, Flink `RowType`, and 
others). A static `FormatModelRegistry` stores implementations keyed by 
`Pair<FileFormat, Class<?>>` and populates itself through Java reflection at 
startup. The old `Parquet.WriteBuilder`, `Avro.WriteBuilder`, and 
`ORC.WriteBuilder` are not deprecated. The new `FormatModel` implementations 
wrap them rather than replace them.
+
+PyIceberg has an open proposal for the equivalent capability in issue 
[apache/iceberg-python#3100](https://github.com/apache/iceberg-python/issues/3100),
 with an in-progress PR at 
[apache/iceberg-python#3119](https://github.com/apache/iceberg-python/pull/3119).
 Because PyIceberg uses PyArrow as its only in-memory representation, the 
proposal drops Java's generic type parameters and keys the registry on file 
format alone. Two prior PyIceberg ORC PRs 
([#790](https://github.com/apache/iceberg-python/pull/790), 
[#2236](https://github.com/apache/iceberg-python/pull/2236)) were closed as 
stale without merging, which reinforces the case for landing an abstraction 
layer before adding new formats.
+
+Design decisions in this RFC that differ from Java (no generics, 
single-dimension registry key, hard cutover from the old Parquet types) are 
justified inline where they appear. Specific Java design points that shaped 
those decisions are called out in Alternatives Considered.
+
+## Goals
+
+The user-facing outcome of this proposal is that every Iceberg data file and 
delete file flows through the same stable and extensible API. Parquet is the 
first format to land. ORC, Avro, and others follow on the same interface. Every 
goal below serves that outcome.
+
+1. Define a `FormatModel` trait that encapsulates format-specific read and 
write behavior independent of the on-disk format.
+
+2. Remove hard-coded Parquet assumptions from scan and write orchestration. 
After this work, `TableScan::to_arrow` and `DataFileWriterBuilder` dispatch 
through the format abstraction instead of constructing Parquet types directly.
+
+3. Provide a registry that maps `DataFileFormat` values to `FormatModel` 
implementations, so callers obtain readers and writers without naming the 
concrete format type.
+
+4. Define a conformance test suite (TCK) that any `FormatModel` implementation 
must pass before it merges.
+
+5. Match the Java and PyIceberg designs where they align, and diverge where 
Rust's single-data-type ecosystem and pre-1.0 status justify it. Divergences 
are called out inline.
+
+## Non-Goals
+
+The items below are deliberately out of scope to keep this proposal focused on 
the abstraction and its Parquet implementation. Most are follow-up work that 
the API enables but does not itself deliver.
+
+1. **Ship new format implementations.** This RFC lands the abstraction and a 
Parquet implementation. ORC, Avro data-file, Vortex, and Lance are follow-up 
RFCs.
+
+2. **Introduce a plugin protocol or runtime library loading.** Rust does not 
offer a clean mechanism for loading compiled plugins at runtime. A 
runtime-linking approach using `libloading` or similar would expand scope 
beyond what the formats currently under discussion (Parquet, ORC, Avro, Vortex, 
Lance) require.
+
+3. **Add Puffin support to the FormatModel.** Puffin files hold statistics and 
deletion vectors rather than row data. They have a different lifecycle from 
data files and are already handled separately in `crates/iceberg/src/puffin/`.
+
+4. **Redesign the writer trait hierarchy.** The existing `IcebergWriter` and 
`FileWriter` layering is sound. This RFC adds a format abstraction beneath 
`FileWriter`, not a replacement for it.
+
+5. **Implement variant shredding or encryption.** Java exposes 
`engineProjection` and `engineSchema` as extension points for variant shredding 
and similar format-specific type mapping, and `withFileEncryptionKey` and 
`withAADPrefix` for Parquet encryption. Equivalent hooks are noted as future 
extensions in the Rust design. Implementing either requires a dedicated RFC.
+
+6. **Change the Iceberg table spec.** This proposal is a Rust-only API change. 
It does not modify the Iceberg spec, the manifest format, the manifest list 
format, or the on-disk layout of any file.
+
+7. **Modify manifest read or write paths.** Manifests and manifest lists 
remain in Avro and are handled by the existing `ManifestReader` and 
`ManifestWriter` paths. The File Format API is about data files and delete 
files only.
+
+
+## Design
+
+The Rust API is three traits and a registry. `FormatModel` is the trait that 
each format implementation provides. `FormatReadBuilder` and 
`FormatWriteBuilder` are the per-operation configurators that `FormatModel` 
returns. `FormatRegistry` maps `DataFileFormat` values to `FormatModel` 
instances. None of the traits carry generic parameters. The subsections below 
introduce each type, and a final "Design rationale" subsection explains the 
choices.
+
+### The FormatModel trait
+
+```rust
+pub trait FormatModel: Send + Sync + 'static {
+    fn format(&self) -> DataFileFormat;
+    fn read_builder(&self, input: InputFile) -> Box<dyn FormatReadBuilder>;
+    fn write_builder(&self, output: OutputFile) -> Box<dyn FormatWriteBuilder>;
+}
+```
+
+Each implementation registers one instance per `DataFileFormat` variant it 
supports. The `format` method returns that variant. `read_builder` and 
`write_builder` are the entry points for reading and writing a file. Both 
return trait objects so that the registry can hand them back from a 
`DataFileFormat`-keyed lookup.
+
+The data type is fixed to Arrow `RecordBatch`, the Iceberg schema type to 
`iceberg::spec::Schema`, and the physical schema type to 
`arrow::datatypes::SchemaRef`. These are the only types in iceberg-rust today 
that would fill the roles Java uses generic parameters for. Arguments for 
keeping them fixed, including comparisons with Java's `FormatModel<D, S>`, are 
in "Design rationale" below.
+
+### The read and write builders
+
+`FormatReadBuilder` and `FormatWriteBuilder` configure a single read or write. 
`FormatModel` produces them, and the caller consumes them with `build`.
+
+```rust
+pub trait FormatReadBuilder: Send {
+    fn project(&mut self, schema: Schema) -> &mut Self;
+    fn filter(&mut self, predicate: BoundPredicate) -> &mut Self;
+    fn split(&mut self, start: u64, length: u64) -> &mut Self;
+    fn case_sensitive(&mut self, case_sensitive: bool) -> &mut Self;
+    fn batch_size(&mut self, batch_size: usize) -> &mut Self;
+    fn build(self: Box<Self>) -> BoxFuture<'static, 
Result<ArrowRecordBatchStream>>;
+}
+
+pub trait FormatWriteBuilder: Send {
+    fn schema(&mut self, schema: Schema) -> &mut Self;
+    fn set(&mut self, key: String, value: String) -> &mut Self;
+    fn build(self: Box<Self>) -> BoxFuture<'static, Result<Box<dyn 
FormatFileWriter>>>;
+}
+```
+
+Both builders take Iceberg `Schema` values. Format implementations convert to 
physical schemas internally using `schema_to_arrow_schema` from 
`arrow/schema.rs`. This deliberately departs from Java's `ReadBuilder<D, S>`, 
which has a separate `engineProjection(S)` method for variant shredding and 
similar per-engine type mapping. Rust's builders expose one projection surface, 
and the hook for a variant-shredding "engine projection" is a future-extension 
point described in "Design rationale."
+
+`FormatWriteBuilder::build` produces a `FormatFileWriter`:
+
+```rust
+pub trait FormatFileWriter: Send {
+    fn write(&mut self, batch: &RecordBatch) -> BoxFuture<'_, Result<()>>;
+    fn close(self: Box<Self>) -> BoxFuture<'static, 
Result<Vec<DataFileBuilder>>>;
+}
+```
+
+The async methods return `BoxFuture` rather than using `async fn` in traits. 
The `self: Box<Self>` signature on `build` and `close` lets those methods 
consume the value while keeping the traits object-safe. Both patterns are 
forced by the trait-object boundary at the registry. The "BoxFuture instead of 
async fn in traits" and "Dynamic dispatch at the registry, static inside the 
format" subsections under "Design rationale" explain why.
+
+### The FormatRegistry
+
+`FormatRegistry` maps `DataFileFormat` values to `FormatModel` instances.
+
+```rust
+pub struct FormatRegistry {
+    models: HashMap<DataFileFormat, Box<dyn FormatModel>>,
+}
+
+impl FormatRegistry {
+    pub fn new() -> Self { ... }
+    pub fn register(&mut self, model: Box<dyn FormatModel>) { ... }
+    pub fn read_builder(
+        &self,
+        format: DataFileFormat,
+        input: InputFile,
+    ) -> Result<Box<dyn FormatReadBuilder>> { ... }
+    pub fn write_builder(
+        &self,
+        format: DataFileFormat,
+        output: OutputFile,
+    ) -> Result<Box<dyn FormatWriteBuilder>> { ... }
+}
+```
+
+Using `DataFileFormat` as a `HashMap` key requires adding `#[derive(Hash)]` to 
the enum. That is a non-breaking addition.
+
+The registry is an owned value, not a global static. Tests construct their 
own. Applications construct one at startup and pass it to scan planners and 
write orchestrators. For the common case of a single registry for the lifetime 
of a process, `default_format_registry()` returns a `&'static FormatRegistry` 
initialized through `OnceLock` on first call.
+
+`read_builder` and `write_builder` return `Err(Error { kind: 
ErrorKind::FeatureUnsupported, .. })` for unregistered formats. The error 
message distinguishes two cases: the format is implemented but its feature flag 
is disabled in this build, or the format has no implementation in this crate.
+
+### Feature flags

Review Comment:
   This could be a non-goal, we only plan to support parquet for now and 
parquet is essential



##########
docs/rfcs/0003_file_format_api.md:
##########
@@ -0,0 +1,520 @@
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+-->
+
+# RFC: File Format API for Apache Iceberg Rust
+
+## Background
+
+### Current state
+
+The `iceberg` crate (version 0.9.0, Rust 1.92 as of this writing) is the core 
library of the Apache Iceberg Rust project. Its module layout in 
`crates/iceberg/src/lib.rs` exposes public modules for `spec`, `arrow`, 
`writer`, `scan`, `io`, `expr`, `transaction`, and others. The crate depends 
directly on the `parquet` crate (with the `async` feature) and on the `arrow-*` 
crates. It has no feature flags today.
+
+For data file writing, the crate provides a three-layer architecture described 
in `crates/iceberg/src/writer/mod.rs`:
+
+1. **`FileWriter` / `FileWriterBuilder`** in `writer/file_writer/mod.rs`: 
traits for physical file writers, generic over an output type (defaulting to 
`Vec<DataFileBuilder>`). `FileWriterBuilder::build` takes an `OutputFile` and 
returns a `FileWriter`. `FileWriter::write` takes a `&RecordBatch`.
+
+2. **`IcebergWriter` / `IcebergWriterBuilder`** in `writer/mod.rs`: traits for 
logical Iceberg writers (data files, equality deletes, position deletes, 
partitioning), defaulting to `RecordBatch` input and `Vec<DataFile>` output.
+
+3. **Concrete implementations**: `ParquetWriterBuilder` and `ParquetWriter` in 
`writer/file_writer/parquet_writer.rs` are the only `FileWriter` 
implementation. Higher-level writers such as `DataFileWriterBuilder` 
(`writer/base_writer/data_file_writer.rs`) and 
`EqualityDeleteFileWriterBuilder` 
(`writer/base_writer/equality_delete_writer.rs`) are generic over any 
`FileWriterBuilder`, but every example, test, and integration instantiates them 
with `ParquetWriterBuilder`.
+
+The trait layer is format-agnostic. Every concrete instantiation uses Parquet.
+
+For data file reading, the crate provides `ArrowReaderBuilder` and 
`ArrowReader` in `arrow/reader.rs`. Both are Parquet-specific despite the 
generic name. `ArrowReader::process_file_scan_task` calls `open_parquet_file` 
directly, constructs a `ParquetRecordBatchReaderBuilder`, and uses 
Parquet-specific row group filtering and page index logic. 
`TableScan::to_arrow` in `scan/mod.rs` wires `ArrowReaderBuilder` in as the 
only reader path. `FileScanTask` carries a `data_file_format: DataFileFormat` 
field, but `process_file_scan_task` ignores it. `DataFileFormat` appears 
throughout `src/`. Almost every non-test reference is 
`DataFileFormat::Parquet`. The only non-Parquet references are two uses of 
`DataFileFormat::Avro` in `transaction/snapshot.rs` for manifest files.
+
+The `DataFileFormat` enum in `spec/manifest/data_file.rs` has four variants: 
`Avro`, `Orc`, `Parquet`, and `Puffin`. Puffin files hold statistics and 
deletion vectors rather than row data. They are handled in 
`crates/iceberg/src/puffin/` and are out of scope for this RFC (see Non-Goals). 
Data file support today:
+
+| Format  | Data file read | Data file write | Manifests |
+|---------|----------------|-----------------|-----------|
+| Parquet | Yes            | Yes             | No        |
+| Avro    | No             | No              | Yes       |
+| ORC     | No             | No              | No        |
+
+A table containing ORC or Avro data files cannot be read from Rust today, even 
though both are valid per the Iceberg spec.
+
+Arrow `RecordBatch` is the only in-memory data representation in the crate. It 
is the input type for every writer trait and the output type for every reader. 
The `arrow/` module provides schema conversion (`schema_to_arrow_schema`, 
`arrow_schema_to_schema`), `RecordBatchTransformer` for schema evolution and 
constant column injection, and the Parquet-to-Arrow read path. Every 
integration crate, including `iceberg-datafusion`, consumes Arrow. The crate 
does not define a generic `Record` type and does not integrate with 
engine-specific row types such as Java's `InternalRow` or `RowData`.
+
+### Pain points
+
+1. **No extension point for new formats.** Adding ORC means editing 
`ArrowReader` to branch on `DataFileFormat` and threading ORC-specific logic 
through every layer that touches it. The write path has the same shape.
+
+2. **Parquet assumptions leak into generic code.** `ArrowReaderBuilder` 
exposes Parquet-specific options (`with_metadata_size_hint`, 
`with_row_group_filtering_enabled`, `with_row_selection_enabled`) that are 
meaningless for ORC or Avro. The name "ArrowReader" conflates the in-memory 
representation (Arrow) with the on-disk format (Parquet).
+
+3. **No format-agnostic statistics.** `ParquetWriter` computes column sizes, 
value counts, and min/max bounds through `MinMaxColAggregator` and 
`NanValueCountVisitor`, both tightly coupled to Parquet's `Statistics` type. 
Another format cannot produce comparable manifest metadata without a shared 
statistics interface.
+
+4. **V3 types will need per-format serialization.** The V3 spec adds Variant 
and Geometry types. Each format encodes them differently: Parquet uses variant 
shredding, ORC uses binary, Avro uses union types. Implementing either type 
without a format abstraction means new `match` arms in every reader and writer 
code path.
+
+### Prior work
+
+The Java project shipped `FormatModel<D, S>` in February 2026 (PR 
[#12774](https://github.com/apache/iceberg/pull/12774)) after a 10-month 
review. Implementations for Parquet, Avro, ORC, and Arrow followed in PRs 
[#15253](https://github.com/apache/iceberg/pull/15253) through 
[#15258](https://github.com/apache/iceberg/pull/15258). Engine migrations for 
Spark ([#15328](https://github.com/apache/iceberg/pull/15328)) and Flink 
([#15329](https://github.com/apache/iceberg/pull/15329)) landed shortly after.
+
+Java's `FormatModel` carries two generic parameters: a data type `D` (Java 
uses `Record`, Spark `InternalRow`, Flink `RowData`, and Arrow `ColumnarBatch`) 
and an engine schema type `S` (Spark `StructType`, Flink `RowType`, and 
others). A static `FormatModelRegistry` stores implementations keyed by 
`Pair<FileFormat, Class<?>>` and populates itself through Java reflection at 
startup. The old `Parquet.WriteBuilder`, `Avro.WriteBuilder`, and 
`ORC.WriteBuilder` are not deprecated. The new `FormatModel` implementations 
wrap them rather than replace them.
+
+PyIceberg has an open proposal for the equivalent capability in issue 
[apache/iceberg-python#3100](https://github.com/apache/iceberg-python/issues/3100),
 with an in-progress PR at 
[apache/iceberg-python#3119](https://github.com/apache/iceberg-python/pull/3119).
 Because PyIceberg uses PyArrow as its only in-memory representation, the 
proposal drops Java's generic type parameters and keys the registry on file 
format alone. Two prior PyIceberg ORC PRs 
([#790](https://github.com/apache/iceberg-python/pull/790), 
[#2236](https://github.com/apache/iceberg-python/pull/2236)) were closed as 
stale without merging, which reinforces the case for landing an abstraction 
layer before adding new formats.
+
+Design decisions in this RFC that differ from Java (no generics, 
single-dimension registry key, hard cutover from the old Parquet types) are 
justified inline where they appear. Specific Java design points that shaped 
those decisions are called out in Alternatives Considered.
+
+## Goals
+
+The user-facing outcome of this proposal is that every Iceberg data file and 
delete file flows through the same stable and extensible API. Parquet is the 
first format to land. ORC, Avro, and others follow on the same interface. Every 
goal below serves that outcome.
+
+1. Define a `FormatModel` trait that encapsulates format-specific read and 
write behavior independent of the on-disk format.
+
+2. Remove hard-coded Parquet assumptions from scan and write orchestration. 
After this work, `TableScan::to_arrow` and `DataFileWriterBuilder` dispatch 
through the format abstraction instead of constructing Parquet types directly.
+
+3. Provide a registry that maps `DataFileFormat` values to `FormatModel` 
implementations, so callers obtain readers and writers without naming the 
concrete format type.
+
+4. Define a conformance test suite (TCK) that any `FormatModel` implementation 
must pass before it merges.
+
+5. Match the Java and PyIceberg designs where they align, and diverge where 
Rust's single-data-type ecosystem and pre-1.0 status justify it. Divergences 
are called out inline.
+
+## Non-Goals
+
+The items below are deliberately out of scope to keep this proposal focused on 
the abstraction and its Parquet implementation. Most are follow-up work that 
the API enables but does not itself deliver.
+
+1. **Ship new format implementations.** This RFC lands the abstraction and a 
Parquet implementation. ORC, Avro data-file, Vortex, and Lance are follow-up 
RFCs.
+
+2. **Introduce a plugin protocol or runtime library loading.** Rust does not 
offer a clean mechanism for loading compiled plugins at runtime. A 
runtime-linking approach using `libloading` or similar would expand scope 
beyond what the formats currently under discussion (Parquet, ORC, Avro, Vortex, 
Lance) require.
+
+3. **Add Puffin support to the FormatModel.** Puffin files hold statistics and 
deletion vectors rather than row data. They have a different lifecycle from 
data files and are already handled separately in `crates/iceberg/src/puffin/`.
+
+4. **Redesign the writer trait hierarchy.** The existing `IcebergWriter` and 
`FileWriter` layering is sound. This RFC adds a format abstraction beneath 
`FileWriter`, not a replacement for it.
+
+5. **Implement variant shredding or encryption.** Java exposes 
`engineProjection` and `engineSchema` as extension points for variant shredding 
and similar format-specific type mapping, and `withFileEncryptionKey` and 
`withAADPrefix` for Parquet encryption. Equivalent hooks are noted as future 
extensions in the Rust design. Implementing either requires a dedicated RFC.
+
+6. **Change the Iceberg table spec.** This proposal is a Rust-only API change. 
It does not modify the Iceberg spec, the manifest format, the manifest list 
format, or the on-disk layout of any file.
+
+7. **Modify manifest read or write paths.** Manifests and manifest lists 
remain in Avro and are handled by the existing `ManifestReader` and 
`ManifestWriter` paths. The File Format API is about data files and delete 
files only.
+
+
+## Design
+
+The Rust API is three traits and a registry. `FormatModel` is the trait that 
each format implementation provides. `FormatReadBuilder` and 
`FormatWriteBuilder` are the per-operation configurators that `FormatModel` 
returns. `FormatRegistry` maps `DataFileFormat` values to `FormatModel` 
instances. None of the traits carry generic parameters. The subsections below 
introduce each type, and a final "Design rationale" subsection explains the 
choices.
+
+### The FormatModel trait
+
+```rust
+pub trait FormatModel: Send + Sync + 'static {
+    fn format(&self) -> DataFileFormat;
+    fn read_builder(&self, input: InputFile) -> Box<dyn FormatReadBuilder>;
+    fn write_builder(&self, output: OutputFile) -> Box<dyn FormatWriteBuilder>;
+}
+```
+
+Each implementation registers one instance per `DataFileFormat` variant it 
supports. The `format` method returns that variant. `read_builder` and 
`write_builder` are the entry points for reading and writing a file. Both 
return trait objects so that the registry can hand them back from a 
`DataFileFormat`-keyed lookup.
+
+The data type is fixed to Arrow `RecordBatch`, the Iceberg schema type to 
`iceberg::spec::Schema`, and the physical schema type to 
`arrow::datatypes::SchemaRef`. These are the only types in iceberg-rust today 
that would fill the roles Java uses generic parameters for. Arguments for 
keeping them fixed, including comparisons with Java's `FormatModel<D, S>`, are 
in "Design rationale" below.
+
+### The read and write builders
+
+`FormatReadBuilder` and `FormatWriteBuilder` configure a single read or write. 
`FormatModel` produces them, and the caller consumes them with `build`.
+
+```rust
+pub trait FormatReadBuilder: Send {
+    fn project(&mut self, schema: Schema) -> &mut Self;
+    fn filter(&mut self, predicate: BoundPredicate) -> &mut Self;
+    fn split(&mut self, start: u64, length: u64) -> &mut Self;
+    fn case_sensitive(&mut self, case_sensitive: bool) -> &mut Self;
+    fn batch_size(&mut self, batch_size: usize) -> &mut Self;
+    fn build(self: Box<Self>) -> BoxFuture<'static, 
Result<ArrowRecordBatchStream>>;
+}
+
+pub trait FormatWriteBuilder: Send {
+    fn schema(&mut self, schema: Schema) -> &mut Self;
+    fn set(&mut self, key: String, value: String) -> &mut Self;
+    fn build(self: Box<Self>) -> BoxFuture<'static, Result<Box<dyn 
FormatFileWriter>>>;
+}
+```
+
+Both builders take Iceberg `Schema` values. Format implementations convert to 
physical schemas internally using `schema_to_arrow_schema` from 
`arrow/schema.rs`. This deliberately departs from Java's `ReadBuilder<D, S>`, 
which has a separate `engineProjection(S)` method for variant shredding and 
similar per-engine type mapping. Rust's builders expose one projection surface, 
and the hook for a variant-shredding "engine projection" is a future-extension 
point described in "Design rationale."
+
+`FormatWriteBuilder::build` produces a `FormatFileWriter`:
+
+```rust
+pub trait FormatFileWriter: Send {
+    fn write(&mut self, batch: &RecordBatch) -> BoxFuture<'_, Result<()>>;
+    fn close(self: Box<Self>) -> BoxFuture<'static, 
Result<Vec<DataFileBuilder>>>;
+}
+```
+
+The async methods return `BoxFuture` rather than using `async fn` in traits. 
The `self: Box<Self>` signature on `build` and `close` lets those methods 
consume the value while keeping the traits object-safe. Both patterns are 
forced by the trait-object boundary at the registry. The "BoxFuture instead of 
async fn in traits" and "Dynamic dispatch at the registry, static inside the 
format" subsections under "Design rationale" explain why.
+
+### The FormatRegistry
+
+`FormatRegistry` maps `DataFileFormat` values to `FormatModel` instances.
+
+```rust
+pub struct FormatRegistry {
+    models: HashMap<DataFileFormat, Box<dyn FormatModel>>,
+}
+
+impl FormatRegistry {
+    pub fn new() -> Self { ... }
+    pub fn register(&mut self, model: Box<dyn FormatModel>) { ... }
+    pub fn read_builder(
+        &self,
+        format: DataFileFormat,
+        input: InputFile,
+    ) -> Result<Box<dyn FormatReadBuilder>> { ... }
+    pub fn write_builder(
+        &self,
+        format: DataFileFormat,
+        output: OutputFile,
+    ) -> Result<Box<dyn FormatWriteBuilder>> { ... }
+}
+```
+
+Using `DataFileFormat` as a `HashMap` key requires adding `#[derive(Hash)]` to 
the enum. That is a non-breaking addition.
+
+The registry is an owned value, not a global static. Tests construct their 
own. Applications construct one at startup and pass it to scan planners and 
write orchestrators. For the common case of a single registry for the lifetime 
of a process, `default_format_registry()` returns a `&'static FormatRegistry` 
initialized through `OnceLock` on first call.
+
+`read_builder` and `write_builder` return `Err(Error { kind: 
ErrorKind::FeatureUnsupported, .. })` for unregistered formats. The error 
message distinguishes two cases: the format is implemented but its feature flag 
is disabled in this build, or the format has no implementation in this crate.
+
+### Feature flags
+
+Format implementations live in `iceberg::formats::{format}` and are gated 
behind a feature flag per format: `format-parquet` on by default, `format-orc` 
when an ORC implementation lands, and so on. The default feature set includes 
every format the crate implements. Users who build from source and want a 
smaller binary disable what they do not need.
+
+`FormatRegistry::default()` registers every format enabled by the current 
feature set at compile time:
+
+```rust
+impl Default for FormatRegistry {
+    fn default() -> Self {
+        let mut registry = Self::new();
+        #[cfg(feature = "format-parquet")]
+        registry.register(Box::new(ParquetFormatModel::new()));
+        #[cfg(feature = "format-orc")]
+        registry.register(Box::new(OrcFormatModel::new()));
+        registry
+    }
+}
+```
+
+Format implementations live inside the `iceberg` crate rather than as separate 
dependencies. This matches how the Java project keeps its `Parquet`, `Avro`, 
and `ORC` modules inside the `apache/iceberg` repository. In-tree 
implementations let the community control each format's quality and release 
cadence and keep the build graph narrow for downstream consumers. Nothing in 
the trait design prevents a downstream crate from defining its own 
`FormatModel` and registering it with a custom `FormatRegistry`, but in-tree is 
the recommended path for formats intended to ship in iceberg-rust itself.
+
+The `parquet` crate remains an unconditional dependency of `iceberg` for now, 
because non-format code (page index evaluators, row group metric evaluators, 
delete file loaders) still uses `parquet` types directly. A later pass gates 
the `parquet` dependency on the `format-parquet` feature once those callers 
move to format-agnostic abstractions.
+
+### Module layout
+
+```
+crates/iceberg/src/formats/
+├── mod.rs        # FormatModel, FormatReadBuilder, FormatWriteBuilder, 
FormatFileWriter
+├── registry.rs   # FormatRegistry, default_format_registry
+└── parquet.rs    # ParquetFormatModel, wrapping existing ParquetWriter and 
ArrowReader

Review Comment:
   nit: This should be formats/parquet/mod.rs? 



##########
docs/rfcs/0003_file_format_api.md:
##########
@@ -0,0 +1,520 @@
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+-->
+
+# RFC: File Format API for Apache Iceberg Rust
+
+## Background
+
+### Current state
+
+The `iceberg` crate (version 0.9.0, Rust 1.92 as of this writing) is the core 
library of the Apache Iceberg Rust project. Its module layout in 
`crates/iceberg/src/lib.rs` exposes public modules for `spec`, `arrow`, 
`writer`, `scan`, `io`, `expr`, `transaction`, and others. The crate depends 
directly on the `parquet` crate (with the `async` feature) and on the `arrow-*` 
crates. It has no feature flags today.
+
+For data file writing, the crate provides a three-layer architecture described 
in `crates/iceberg/src/writer/mod.rs`:
+
+1. **`FileWriter` / `FileWriterBuilder`** in `writer/file_writer/mod.rs`: 
traits for physical file writers, generic over an output type (defaulting to 
`Vec<DataFileBuilder>`). `FileWriterBuilder::build` takes an `OutputFile` and 
returns a `FileWriter`. `FileWriter::write` takes a `&RecordBatch`.
+
+2. **`IcebergWriter` / `IcebergWriterBuilder`** in `writer/mod.rs`: traits for 
logical Iceberg writers (data files, equality deletes, position deletes, 
partitioning), defaulting to `RecordBatch` input and `Vec<DataFile>` output.
+
+3. **Concrete implementations**: `ParquetWriterBuilder` and `ParquetWriter` in 
`writer/file_writer/parquet_writer.rs` are the only `FileWriter` 
implementation. Higher-level writers such as `DataFileWriterBuilder` 
(`writer/base_writer/data_file_writer.rs`) and 
`EqualityDeleteFileWriterBuilder` 
(`writer/base_writer/equality_delete_writer.rs`) are generic over any 
`FileWriterBuilder`, but every example, test, and integration instantiates them 
with `ParquetWriterBuilder`.
+
+The trait layer is format-agnostic. Every concrete instantiation uses Parquet.
+
+For data file reading, the crate provides `ArrowReaderBuilder` and 
`ArrowReader` in `arrow/reader.rs`. Both are Parquet-specific despite the 
generic name. `ArrowReader::process_file_scan_task` calls `open_parquet_file` 
directly, constructs a `ParquetRecordBatchReaderBuilder`, and uses 
Parquet-specific row group filtering and page index logic. 
`TableScan::to_arrow` in `scan/mod.rs` wires `ArrowReaderBuilder` in as the 
only reader path. `FileScanTask` carries a `data_file_format: DataFileFormat` 
field, but `process_file_scan_task` ignores it. `DataFileFormat` appears 
throughout `src/`. Almost every non-test reference is 
`DataFileFormat::Parquet`. The only non-Parquet references are two uses of 
`DataFileFormat::Avro` in `transaction/snapshot.rs` for manifest files.
+
+The `DataFileFormat` enum in `spec/manifest/data_file.rs` has four variants: 
`Avro`, `Orc`, `Parquet`, and `Puffin`. Puffin files hold statistics and 
deletion vectors rather than row data. They are handled in 
`crates/iceberg/src/puffin/` and are out of scope for this RFC (see Non-Goals). 
Data file support today:
+
+| Format  | Data file read | Data file write | Manifests |
+|---------|----------------|-----------------|-----------|
+| Parquet | Yes            | Yes             | No        |
+| Avro    | No             | No              | Yes       |
+| ORC     | No             | No              | No        |
+
+A table containing ORC or Avro data files cannot be read from Rust today, even 
though both are valid per the Iceberg spec.
+
+Arrow `RecordBatch` is the only in-memory data representation in the crate. It 
is the input type for every writer trait and the output type for every reader. 
The `arrow/` module provides schema conversion (`schema_to_arrow_schema`, 
`arrow_schema_to_schema`), `RecordBatchTransformer` for schema evolution and 
constant column injection, and the Parquet-to-Arrow read path. Every 
integration crate, including `iceberg-datafusion`, consumes Arrow. The crate 
does not define a generic `Record` type and does not integrate with 
engine-specific row types such as Java's `InternalRow` or `RowData`.
+
+### Pain points
+
+1. **No extension point for new formats.** Adding ORC means editing 
`ArrowReader` to branch on `DataFileFormat` and threading ORC-specific logic 
through every layer that touches it. The write path has the same shape.
+
+2. **Parquet assumptions leak into generic code.** `ArrowReaderBuilder` 
exposes Parquet-specific options (`with_metadata_size_hint`, 
`with_row_group_filtering_enabled`, `with_row_selection_enabled`) that are 
meaningless for ORC or Avro. The name "ArrowReader" conflates the in-memory 
representation (Arrow) with the on-disk format (Parquet).
+
+3. **No format-agnostic statistics.** `ParquetWriter` computes column sizes, 
value counts, and min/max bounds through `MinMaxColAggregator` and 
`NanValueCountVisitor`, both tightly coupled to Parquet's `Statistics` type. 
Another format cannot produce comparable manifest metadata without a shared 
statistics interface.
+
+4. **V3 types will need per-format serialization.** The V3 spec adds Variant 
and Geometry types. Each format encodes them differently: Parquet uses variant 
shredding, ORC uses binary, Avro uses union types. Implementing either type 
without a format abstraction means new `match` arms in every reader and writer 
code path.
+
+### Prior work
+
+The Java project shipped `FormatModel<D, S>` in February 2026 (PR 
[#12774](https://github.com/apache/iceberg/pull/12774)) after a 10-month 
review. Implementations for Parquet, Avro, ORC, and Arrow followed in PRs 
[#15253](https://github.com/apache/iceberg/pull/15253) through 
[#15258](https://github.com/apache/iceberg/pull/15258). Engine migrations for 
Spark ([#15328](https://github.com/apache/iceberg/pull/15328)) and Flink 
([#15329](https://github.com/apache/iceberg/pull/15329)) landed shortly after.
+
+Java's `FormatModel` carries two generic parameters: a data type `D` (Java 
uses `Record`, Spark `InternalRow`, Flink `RowData`, and Arrow `ColumnarBatch`) 
and an engine schema type `S` (Spark `StructType`, Flink `RowType`, and 
others). A static `FormatModelRegistry` stores implementations keyed by 
`Pair<FileFormat, Class<?>>` and populates itself through Java reflection at 
startup. The old `Parquet.WriteBuilder`, `Avro.WriteBuilder`, and 
`ORC.WriteBuilder` are not deprecated. The new `FormatModel` implementations 
wrap them rather than replace them.
+
+PyIceberg has an open proposal for the equivalent capability in issue 
[apache/iceberg-python#3100](https://github.com/apache/iceberg-python/issues/3100),
 with an in-progress PR at 
[apache/iceberg-python#3119](https://github.com/apache/iceberg-python/pull/3119).
 Because PyIceberg uses PyArrow as its only in-memory representation, the 
proposal drops Java's generic type parameters and keys the registry on file 
format alone. Two prior PyIceberg ORC PRs 
([#790](https://github.com/apache/iceberg-python/pull/790), 
[#2236](https://github.com/apache/iceberg-python/pull/2236)) were closed as 
stale without merging, which reinforces the case for landing an abstraction 
layer before adding new formats.
+
+Design decisions in this RFC that differ from Java (no generics, 
single-dimension registry key, hard cutover from the old Parquet types) are 
justified inline where they appear. Specific Java design points that shaped 
those decisions are called out in Alternatives Considered.
+
+## Goals
+
+The user-facing outcome of this proposal is that every Iceberg data file and 
delete file flows through the same stable and extensible API. Parquet is the 
first format to land. ORC, Avro, and others follow on the same interface. Every 
goal below serves that outcome.
+
+1. Define a `FormatModel` trait that encapsulates format-specific read and 
write behavior independent of the on-disk format.
+
+2. Remove hard-coded Parquet assumptions from scan and write orchestration. 
After this work, `TableScan::to_arrow` and `DataFileWriterBuilder` dispatch 
through the format abstraction instead of constructing Parquet types directly.
+
+3. Provide a registry that maps `DataFileFormat` values to `FormatModel` 
implementations, so callers obtain readers and writers without naming the 
concrete format type.
+
+4. Define a conformance test suite (TCK) that any `FormatModel` implementation 
must pass before it merges.
+
+5. Match the Java and PyIceberg designs where they align, and diverge where 
Rust's single-data-type ecosystem and pre-1.0 status justify it. Divergences 
are called out inline.
+
+## Non-Goals
+
+The items below are deliberately out of scope to keep this proposal focused on 
the abstraction and its Parquet implementation. Most are follow-up work that 
the API enables but does not itself deliver.
+
+1. **Ship new format implementations.** This RFC lands the abstraction and a 
Parquet implementation. ORC, Avro data-file, Vortex, and Lance are follow-up 
RFCs.
+
+2. **Introduce a plugin protocol or runtime library loading.** Rust does not 
offer a clean mechanism for loading compiled plugins at runtime. A 
runtime-linking approach using `libloading` or similar would expand scope 
beyond what the formats currently under discussion (Parquet, ORC, Avro, Vortex, 
Lance) require.
+
+3. **Add Puffin support to the FormatModel.** Puffin files hold statistics and 
deletion vectors rather than row data. They have a different lifecycle from 
data files and are already handled separately in `crates/iceberg/src/puffin/`.
+
+4. **Redesign the writer trait hierarchy.** The existing `IcebergWriter` and 
`FileWriter` layering is sound. This RFC adds a format abstraction beneath 
`FileWriter`, not a replacement for it.
+
+5. **Implement variant shredding or encryption.** Java exposes 
`engineProjection` and `engineSchema` as extension points for variant shredding 
and similar format-specific type mapping, and `withFileEncryptionKey` and 
`withAADPrefix` for Parquet encryption. Equivalent hooks are noted as future 
extensions in the Rust design. Implementing either requires a dedicated RFC.
+
+6. **Change the Iceberg table spec.** This proposal is a Rust-only API change. 
It does not modify the Iceberg spec, the manifest format, the manifest list 
format, or the on-disk layout of any file.
+
+7. **Modify manifest read or write paths.** Manifests and manifest lists 
remain in Avro and are handled by the existing `ManifestReader` and 
`ManifestWriter` paths. The File Format API is about data files and delete 
files only.
+
+
+## Design
+
+The Rust API is three traits and a registry. `FormatModel` is the trait that 
each format implementation provides. `FormatReadBuilder` and 
`FormatWriteBuilder` are the per-operation configurators that `FormatModel` 
returns. `FormatRegistry` maps `DataFileFormat` values to `FormatModel` 
instances. None of the traits carry generic parameters. The subsections below 
introduce each type, and a final "Design rationale" subsection explains the 
choices.
+
+### The FormatModel trait
+
+```rust
+pub trait FormatModel: Send + Sync + 'static {
+    fn format(&self) -> DataFileFormat;
+    fn read_builder(&self, input: InputFile) -> Box<dyn FormatReadBuilder>;
+    fn write_builder(&self, output: OutputFile) -> Box<dyn FormatWriteBuilder>;
+}
+```
+
+Each implementation registers one instance per `DataFileFormat` variant it 
supports. The `format` method returns that variant. `read_builder` and 
`write_builder` are the entry points for reading and writing a file. Both 
return trait objects so that the registry can hand them back from a 
`DataFileFormat`-keyed lookup.
+
+The data type is fixed to Arrow `RecordBatch`, the Iceberg schema type to 
`iceberg::spec::Schema`, and the physical schema type to 
`arrow::datatypes::SchemaRef`. These are the only types in iceberg-rust today 
that would fill the roles Java uses generic parameters for. Arguments for 
keeping them fixed, including comparisons with Java's `FormatModel<D, S>`, are 
in "Design rationale" below.
+
+### The read and write builders
+
+`FormatReadBuilder` and `FormatWriteBuilder` configure a single read or write. 
`FormatModel` produces them, and the caller consumes them with `build`.
+
+```rust
+pub trait FormatReadBuilder: Send {
+    fn project(&mut self, schema: Schema) -> &mut Self;
+    fn filter(&mut self, predicate: BoundPredicate) -> &mut Self;
+    fn split(&mut self, start: u64, length: u64) -> &mut Self;
+    fn case_sensitive(&mut self, case_sensitive: bool) -> &mut Self;
+    fn batch_size(&mut self, batch_size: usize) -> &mut Self;
+    fn build(self: Box<Self>) -> BoxFuture<'static, 
Result<ArrowRecordBatchStream>>;
+}
+
+pub trait FormatWriteBuilder: Send {
+    fn schema(&mut self, schema: Schema) -> &mut Self;
+    fn set(&mut self, key: String, value: String) -> &mut Self;
+    fn build(self: Box<Self>) -> BoxFuture<'static, Result<Box<dyn 
FormatFileWriter>>>;
+}
+```
+
+Both builders take Iceberg `Schema` values. Format implementations convert to 
physical schemas internally using `schema_to_arrow_schema` from 
`arrow/schema.rs`. This deliberately departs from Java's `ReadBuilder<D, S>`, 
which has a separate `engineProjection(S)` method for variant shredding and 
similar per-engine type mapping. Rust's builders expose one projection surface, 
and the hook for a variant-shredding "engine projection" is a future-extension 
point described in "Design rationale."
+
+`FormatWriteBuilder::build` produces a `FormatFileWriter`:
+
+```rust
+pub trait FormatFileWriter: Send {
+    fn write(&mut self, batch: &RecordBatch) -> BoxFuture<'_, Result<()>>;
+    fn close(self: Box<Self>) -> BoxFuture<'static, 
Result<Vec<DataFileBuilder>>>;
+}
+```
+
+The async methods return `BoxFuture` rather than using `async fn` in traits. 
The `self: Box<Self>` signature on `build` and `close` lets those methods 
consume the value while keeping the traits object-safe. Both patterns are 
forced by the trait-object boundary at the registry. The "BoxFuture instead of 
async fn in traits" and "Dynamic dispatch at the registry, static inside the 
format" subsections under "Design rationale" explain why.
+
+### The FormatRegistry
+
+`FormatRegistry` maps `DataFileFormat` values to `FormatModel` instances.
+
+```rust
+pub struct FormatRegistry {
+    models: HashMap<DataFileFormat, Box<dyn FormatModel>>,
+}
+
+impl FormatRegistry {
+    pub fn new() -> Self { ... }
+    pub fn register(&mut self, model: Box<dyn FormatModel>) { ... }
+    pub fn read_builder(
+        &self,
+        format: DataFileFormat,
+        input: InputFile,
+    ) -> Result<Box<dyn FormatReadBuilder>> { ... }
+    pub fn write_builder(
+        &self,
+        format: DataFileFormat,
+        output: OutputFile,
+    ) -> Result<Box<dyn FormatWriteBuilder>> { ... }
+}
+```
+
+Using `DataFileFormat` as a `HashMap` key requires adding `#[derive(Hash)]` to 
the enum. That is a non-breaking addition.
+
+The registry is an owned value, not a global static. Tests construct their 
own. Applications construct one at startup and pass it to scan planners and 
write orchestrators. For the common case of a single registry for the lifetime 
of a process, `default_format_registry()` returns a `&'static FormatRegistry` 
initialized through `OnceLock` on first call.
+
+`read_builder` and `write_builder` return `Err(Error { kind: 
ErrorKind::FeatureUnsupported, .. })` for unregistered formats. The error 
message distinguishes two cases: the format is implemented but its feature flag 
is disabled in this build, or the format has no implementation in this crate.
+
+### Feature flags
+
+Format implementations live in `iceberg::formats::{format}` and are gated 
behind a feature flag per format: `format-parquet` on by default, `format-orc` 
when an ORC implementation lands, and so on. The default feature set includes 
every format the crate implements. Users who build from source and want a 
smaller binary disable what they do not need.
+
+`FormatRegistry::default()` registers every format enabled by the current 
feature set at compile time:
+
+```rust
+impl Default for FormatRegistry {
+    fn default() -> Self {
+        let mut registry = Self::new();
+        #[cfg(feature = "format-parquet")]
+        registry.register(Box::new(ParquetFormatModel::new()));
+        #[cfg(feature = "format-orc")]
+        registry.register(Box::new(OrcFormatModel::new()));
+        registry
+    }
+}
+```
+
+Format implementations live inside the `iceberg` crate rather than as separate 
dependencies. This matches how the Java project keeps its `Parquet`, `Avro`, 
and `ORC` modules inside the `apache/iceberg` repository. In-tree 
implementations let the community control each format's quality and release 
cadence and keep the build graph narrow for downstream consumers. Nothing in 
the trait design prevents a downstream crate from defining its own 
`FormatModel` and registering it with a custom `FormatRegistry`, but in-tree is 
the recommended path for formats intended to ship in iceberg-rust itself.
+
+The `parquet` crate remains an unconditional dependency of `iceberg` for now, 
because non-format code (page index evaluators, row group metric evaluators, 
delete file loaders) still uses `parquet` types directly. A later pass gates 
the `parquet` dependency on the `format-parquet` feature once those callers 
move to format-agnostic abstractions.
+
+### Module layout
+
+```
+crates/iceberg/src/formats/
+├── mod.rs        # FormatModel, FormatReadBuilder, FormatWriteBuilder, 
FormatFileWriter
+├── registry.rs   # FormatRegistry, default_format_registry
+└── parquet.rs    # ParquetFormatModel, wrapping existing ParquetWriter and 
ArrowReader
+```
+
+Additional formats land as `formats/orc/`, `formats/avro_data/`, and so on, 
with directory-per-format for anything larger than a few hundred lines.
+
+In the initial implementation, `ParquetWriter`, `ParquetWriterBuilder`, 
`ArrowReader`, and `ArrowReaderBuilder` stay at their current module paths. 
`ParquetFormatModel` wraps them. Phase 3 of the Migration Plan moves them into 
`formats/parquet/` and retires the old paths.
+
+### Design rationale
+
+#### No generic over the data type
+
+Java's `FormatModel<D, S>` uses `D` for Spark `InternalRow`, Flink `RowData`, 
Arrow `ColumnarBatch`, and other engine-native row types. iceberg-rust has one 
row type: Arrow `RecordBatch`. Every writer accepts it, and every reader 
returns it. No format on the near-term queue (ORC, Avro data-file, Vortex, 
Lance) produces anything other than `RecordBatch` at the Iceberg-facing 
boundary. No engine integration in iceberg-rust today brings an engine-native 
row type the way Spark and Flink do in Java.
+
+Adding a `D` parameter today means writing `<RecordBatch>` everywhere the 
trait is used, for no present caller benefit. If a future format cannot bridge 
to Arrow, the trait can gain an associated type with a default, which is a 
semver-compatible addition.
+
+#### No generic over the engine schema
+
+Java's `S` parameter serves Spark's `StructType`, Flink's `RowType`, and the 
other engine-native schema types. iceberg-rust has `iceberg::spec::Schema` for 
the logical schema and `arrow::datatypes::SchemaRef` for the physical schema, 
with conversion in `arrow/schema.rs`. There is no third schema type a generic 
parameter would serve.
+
+Variant shredding is the concrete use case that Java's `engineProjection(S)` 
method addresses. `FormatReadBuilder` can later gain an `engine_projection(&mut 
self, schema: ArrowSchemaRef) -> &mut Self` method with a default no-op 
implementation. Format implementations that support shredding override it. The 
parameter is `ArrowSchemaRef`, not a generic `S`, because Arrow is the only 
engine schema in iceberg-rust.
+
+#### Dynamic dispatch at the registry, static inside the format
+
+Registry lookup is inherently dynamic: the caller has a `DataFileFormat` value 
at runtime. The read or write hot loop should use static dispatch and inline 
normally. Splitting the two puts `Box<dyn FormatModel>` at the registry 
boundary and concrete types everywhere inside the format implementation. 
Dispatch through the trait object happens once per `read_builder` or 
`write_builder` call, which is once per file. The hot loop sees concrete types.
+
+`object_store` uses the same split (`dyn ObjectStore` at the boundary, 
concrete `AsyncRead` and `AsyncWrite` streams returned). `datafusion` uses the 
same split (`dyn TableProvider` at the boundary, concrete `RecordBatchStream` 
returned).
+
+#### BoxFuture instead of async fn in traits
+
+`async fn` in traits (stable since Rust 1.75) works for static dispatch. It 
does not work through `dyn Trait`, because the returned future has an opaque 
type that trait-object dispatch cannot carry. The `async-trait` crate desugars 
`async fn` to the same `BoxFuture` a manual implementation produces. Using 
`BoxFuture` explicitly keeps the future's lifetime visible in the signature and 
avoids one proc-macro expansion per trait.
+
+#### Feature flags instead of inventory or libloading
+
+Two other registration mechanisms were considered.
+
+The `inventory` crate collects static items at link time through 
platform-specific linker sections. Each format would register itself with a 
`submit!` macro. `inventory` drops items silently across static-library 
boundaries, interacts poorly with analysis tools that do not run a real linker, 
and registers formats that the caller did not explicitly depend on.
+
+Runtime library loading through `libloading` or similar would allow formats to 
ship as separate shared libraries. Rust has no stable ABI, so implementations 
would need to match the host's compiler version and feature flags exactly. 
Every use would need `unsafe`. PyIceberg has flagged runtime loading as an open 
question without landing a mechanism.
+
+Feature flags work across `cargo check`, `cargo miri`, cross-compilation, and 
static linking. `datafusion`, `opendal`, and `reqwest` use the same pattern.
+
+#### RecordBatch as the canonical data type

Review Comment:
   Is this the same as the first point?



##########
docs/rfcs/0003_file_format_api.md:
##########
@@ -0,0 +1,520 @@
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+-->
+
+# RFC: File Format API for Apache Iceberg Rust
+
+## Background
+
+### Current state
+
+The `iceberg` crate (version 0.9.0, Rust 1.92 as of this writing) is the core 
library of the Apache Iceberg Rust project. Its module layout in 
`crates/iceberg/src/lib.rs` exposes public modules for `spec`, `arrow`, 
`writer`, `scan`, `io`, `expr`, `transaction`, and others. The crate depends 
directly on the `parquet` crate (with the `async` feature) and on the `arrow-*` 
crates. It has no feature flags today.
+
+For data file writing, the crate provides a three-layer architecture described 
in `crates/iceberg/src/writer/mod.rs`:
+
+1. **`FileWriter` / `FileWriterBuilder`** in `writer/file_writer/mod.rs`: 
traits for physical file writers, generic over an output type (defaulting to 
`Vec<DataFileBuilder>`). `FileWriterBuilder::build` takes an `OutputFile` and 
returns a `FileWriter`. `FileWriter::write` takes a `&RecordBatch`.
+
+2. **`IcebergWriter` / `IcebergWriterBuilder`** in `writer/mod.rs`: traits for 
logical Iceberg writers (data files, equality deletes, position deletes, 
partitioning), defaulting to `RecordBatch` input and `Vec<DataFile>` output.
+
+3. **Concrete implementations**: `ParquetWriterBuilder` and `ParquetWriter` in 
`writer/file_writer/parquet_writer.rs` are the only `FileWriter` 
implementation. Higher-level writers such as `DataFileWriterBuilder` 
(`writer/base_writer/data_file_writer.rs`) and 
`EqualityDeleteFileWriterBuilder` 
(`writer/base_writer/equality_delete_writer.rs`) are generic over any 
`FileWriterBuilder`, but every example, test, and integration instantiates them 
with `ParquetWriterBuilder`.
+
+The trait layer is format-agnostic. Every concrete instantiation uses Parquet.
+
+For data file reading, the crate provides `ArrowReaderBuilder` and 
`ArrowReader` in `arrow/reader.rs`. Both are Parquet-specific despite the 
generic name. `ArrowReader::process_file_scan_task` calls `open_parquet_file` 
directly, constructs a `ParquetRecordBatchReaderBuilder`, and uses 
Parquet-specific row group filtering and page index logic. 
`TableScan::to_arrow` in `scan/mod.rs` wires `ArrowReaderBuilder` in as the 
only reader path. `FileScanTask` carries a `data_file_format: DataFileFormat` 
field, but `process_file_scan_task` ignores it. `DataFileFormat` appears 
throughout `src/`. Almost every non-test reference is 
`DataFileFormat::Parquet`. The only non-Parquet references are two uses of 
`DataFileFormat::Avro` in `transaction/snapshot.rs` for manifest files.
+
+The `DataFileFormat` enum in `spec/manifest/data_file.rs` has four variants: 
`Avro`, `Orc`, `Parquet`, and `Puffin`. Puffin files hold statistics and 
deletion vectors rather than row data. They are handled in 
`crates/iceberg/src/puffin/` and are out of scope for this RFC (see Non-Goals). 
Data file support today:
+
+| Format  | Data file read | Data file write | Manifests |
+|---------|----------------|-----------------|-----------|
+| Parquet | Yes            | Yes             | No        |
+| Avro    | No             | No              | Yes       |
+| ORC     | No             | No              | No        |
+
+A table containing ORC or Avro data files cannot be read from Rust today, even 
though both are valid per the Iceberg spec.
+
+Arrow `RecordBatch` is the only in-memory data representation in the crate. It 
is the input type for every writer trait and the output type for every reader. 
The `arrow/` module provides schema conversion (`schema_to_arrow_schema`, 
`arrow_schema_to_schema`), `RecordBatchTransformer` for schema evolution and 
constant column injection, and the Parquet-to-Arrow read path. Every 
integration crate, including `iceberg-datafusion`, consumes Arrow. The crate 
does not define a generic `Record` type and does not integrate with 
engine-specific row types such as Java's `InternalRow` or `RowData`.
+
+### Pain points
+
+1. **No extension point for new formats.** Adding ORC means editing 
`ArrowReader` to branch on `DataFileFormat` and threading ORC-specific logic 
through every layer that touches it. The write path has the same shape.
+
+2. **Parquet assumptions leak into generic code.** `ArrowReaderBuilder` 
exposes Parquet-specific options (`with_metadata_size_hint`, 
`with_row_group_filtering_enabled`, `with_row_selection_enabled`) that are 
meaningless for ORC or Avro. The name "ArrowReader" conflates the in-memory 
representation (Arrow) with the on-disk format (Parquet).
+
+3. **No format-agnostic statistics.** `ParquetWriter` computes column sizes, 
value counts, and min/max bounds through `MinMaxColAggregator` and 
`NanValueCountVisitor`, both tightly coupled to Parquet's `Statistics` type. 
Another format cannot produce comparable manifest metadata without a shared 
statistics interface.
+
+4. **V3 types will need per-format serialization.** The V3 spec adds Variant 
and Geometry types. Each format encodes them differently: Parquet uses variant 
shredding, ORC uses binary, Avro uses union types. Implementing either type 
without a format abstraction means new `match` arms in every reader and writer 
code path.
+
+### Prior work
+
+The Java project shipped `FormatModel<D, S>` in February 2026 (PR 
[#12774](https://github.com/apache/iceberg/pull/12774)) after a 10-month 
review. Implementations for Parquet, Avro, ORC, and Arrow followed in PRs 
[#15253](https://github.com/apache/iceberg/pull/15253) through 
[#15258](https://github.com/apache/iceberg/pull/15258). Engine migrations for 
Spark ([#15328](https://github.com/apache/iceberg/pull/15328)) and Flink 
([#15329](https://github.com/apache/iceberg/pull/15329)) landed shortly after.
+
+Java's `FormatModel` carries two generic parameters: a data type `D` (Java 
uses `Record`, Spark `InternalRow`, Flink `RowData`, and Arrow `ColumnarBatch`) 
and an engine schema type `S` (Spark `StructType`, Flink `RowType`, and 
others). A static `FormatModelRegistry` stores implementations keyed by 
`Pair<FileFormat, Class<?>>` and populates itself through Java reflection at 
startup. The old `Parquet.WriteBuilder`, `Avro.WriteBuilder`, and 
`ORC.WriteBuilder` are not deprecated. The new `FormatModel` implementations 
wrap them rather than replace them.
+
+PyIceberg has an open proposal for the equivalent capability in issue 
[apache/iceberg-python#3100](https://github.com/apache/iceberg-python/issues/3100),
 with an in-progress PR at 
[apache/iceberg-python#3119](https://github.com/apache/iceberg-python/pull/3119).
 Because PyIceberg uses PyArrow as its only in-memory representation, the 
proposal drops Java's generic type parameters and keys the registry on file 
format alone. Two prior PyIceberg ORC PRs 
([#790](https://github.com/apache/iceberg-python/pull/790), 
[#2236](https://github.com/apache/iceberg-python/pull/2236)) were closed as 
stale without merging, which reinforces the case for landing an abstraction 
layer before adding new formats.
+
+Design decisions in this RFC that differ from Java (no generics, 
single-dimension registry key, hard cutover from the old Parquet types) are 
justified inline where they appear. Specific Java design points that shaped 
those decisions are called out in Alternatives Considered.
+
+## Goals
+
+The user-facing outcome of this proposal is that every Iceberg data file and 
delete file flows through the same stable and extensible API. Parquet is the 
first format to land. ORC, Avro, and others follow on the same interface. Every 
goal below serves that outcome.
+
+1. Define a `FormatModel` trait that encapsulates format-specific read and 
write behavior independent of the on-disk format.
+
+2. Remove hard-coded Parquet assumptions from scan and write orchestration. 
After this work, `TableScan::to_arrow` and `DataFileWriterBuilder` dispatch 
through the format abstraction instead of constructing Parquet types directly.
+
+3. Provide a registry that maps `DataFileFormat` values to `FormatModel` 
implementations, so callers obtain readers and writers without naming the 
concrete format type.
+
+4. Define a conformance test suite (TCK) that any `FormatModel` implementation 
must pass before it merges.
+
+5. Match the Java and PyIceberg designs where they align, and diverge where 
Rust's single-data-type ecosystem and pre-1.0 status justify it. Divergences 
are called out inline.
+
+## Non-Goals
+
+The items below are deliberately out of scope to keep this proposal focused on 
the abstraction and its Parquet implementation. Most are follow-up work that 
the API enables but does not itself deliver.
+
+1. **Ship new format implementations.** This RFC lands the abstraction and a 
Parquet implementation. ORC, Avro data-file, Vortex, and Lance are follow-up 
RFCs.
+
+2. **Introduce a plugin protocol or runtime library loading.** Rust does not 
offer a clean mechanism for loading compiled plugins at runtime. A 
runtime-linking approach using `libloading` or similar would expand scope 
beyond what the formats currently under discussion (Parquet, ORC, Avro, Vortex, 
Lance) require.
+
+3. **Add Puffin support to the FormatModel.** Puffin files hold statistics and 
deletion vectors rather than row data. They have a different lifecycle from 
data files and are already handled separately in `crates/iceberg/src/puffin/`.
+
+4. **Redesign the writer trait hierarchy.** The existing `IcebergWriter` and 
`FileWriter` layering is sound. This RFC adds a format abstraction beneath 
`FileWriter`, not a replacement for it.
+
+5. **Implement variant shredding or encryption.** Java exposes 
`engineProjection` and `engineSchema` as extension points for variant shredding 
and similar format-specific type mapping, and `withFileEncryptionKey` and 
`withAADPrefix` for Parquet encryption. Equivalent hooks are noted as future 
extensions in the Rust design. Implementing either requires a dedicated RFC.
+
+6. **Change the Iceberg table spec.** This proposal is a Rust-only API change. 
It does not modify the Iceberg spec, the manifest format, the manifest list 
format, or the on-disk layout of any file.
+
+7. **Modify manifest read or write paths.** Manifests and manifest lists 
remain in Avro and are handled by the existing `ManifestReader` and 
`ManifestWriter` paths. The File Format API is about data files and delete 
files only.
+
+
+## Design
+
+The Rust API is three traits and a registry. `FormatModel` is the trait that 
each format implementation provides. `FormatReadBuilder` and 
`FormatWriteBuilder` are the per-operation configurators that `FormatModel` 
returns. `FormatRegistry` maps `DataFileFormat` values to `FormatModel` 
instances. None of the traits carry generic parameters. The subsections below 
introduce each type, and a final "Design rationale" subsection explains the 
choices.
+
+### The FormatModel trait
+
+```rust
+pub trait FormatModel: Send + Sync + 'static {
+    fn format(&self) -> DataFileFormat;
+    fn read_builder(&self, input: InputFile) -> Box<dyn FormatReadBuilder>;
+    fn write_builder(&self, output: OutputFile) -> Box<dyn FormatWriteBuilder>;
+}
+```
+
+Each implementation registers one instance per `DataFileFormat` variant it 
supports. The `format` method returns that variant. `read_builder` and 
`write_builder` are the entry points for reading and writing a file. Both 
return trait objects so that the registry can hand them back from a 
`DataFileFormat`-keyed lookup.
+
+The data type is fixed to Arrow `RecordBatch`, the Iceberg schema type to 
`iceberg::spec::Schema`, and the physical schema type to 
`arrow::datatypes::SchemaRef`. These are the only types in iceberg-rust today 
that would fill the roles Java uses generic parameters for. Arguments for 
keeping them fixed, including comparisons with Java's `FormatModel<D, S>`, are 
in "Design rationale" below.
+
+### The read and write builders
+
+`FormatReadBuilder` and `FormatWriteBuilder` configure a single read or write. 
`FormatModel` produces them, and the caller consumes them with `build`.
+
+```rust
+pub trait FormatReadBuilder: Send {
+    fn project(&mut self, schema: Schema) -> &mut Self;
+    fn filter(&mut self, predicate: BoundPredicate) -> &mut Self;
+    fn split(&mut self, start: u64, length: u64) -> &mut Self;
+    fn case_sensitive(&mut self, case_sensitive: bool) -> &mut Self;
+    fn batch_size(&mut self, batch_size: usize) -> &mut Self;
+    fn build(self: Box<Self>) -> BoxFuture<'static, 
Result<ArrowRecordBatchStream>>;
+}
+
+pub trait FormatWriteBuilder: Send {
+    fn schema(&mut self, schema: Schema) -> &mut Self;
+    fn set(&mut self, key: String, value: String) -> &mut Self;
+    fn build(self: Box<Self>) -> BoxFuture<'static, Result<Box<dyn 
FormatFileWriter>>>;
+}
+```
+
+Both builders take Iceberg `Schema` values. Format implementations convert to 
physical schemas internally using `schema_to_arrow_schema` from 
`arrow/schema.rs`. This deliberately departs from Java's `ReadBuilder<D, S>`, 
which has a separate `engineProjection(S)` method for variant shredding and 
similar per-engine type mapping. Rust's builders expose one projection surface, 
and the hook for a variant-shredding "engine projection" is a future-extension 
point described in "Design rationale."
+
+`FormatWriteBuilder::build` produces a `FormatFileWriter`:
+
+```rust
+pub trait FormatFileWriter: Send {
+    fn write(&mut self, batch: &RecordBatch) -> BoxFuture<'_, Result<()>>;
+    fn close(self: Box<Self>) -> BoxFuture<'static, 
Result<Vec<DataFileBuilder>>>;
+}
+```
+
+The async methods return `BoxFuture` rather than using `async fn` in traits. 
The `self: Box<Self>` signature on `build` and `close` lets those methods 
consume the value while keeping the traits object-safe. Both patterns are 
forced by the trait-object boundary at the registry. The "BoxFuture instead of 
async fn in traits" and "Dynamic dispatch at the registry, static inside the 
format" subsections under "Design rationale" explain why.
+
+### The FormatRegistry
+
+`FormatRegistry` maps `DataFileFormat` values to `FormatModel` instances.
+
+```rust
+pub struct FormatRegistry {
+    models: HashMap<DataFileFormat, Box<dyn FormatModel>>,
+}
+
+impl FormatRegistry {
+    pub fn new() -> Self { ... }
+    pub fn register(&mut self, model: Box<dyn FormatModel>) { ... }
+    pub fn read_builder(
+        &self,
+        format: DataFileFormat,
+        input: InputFile,
+    ) -> Result<Box<dyn FormatReadBuilder>> { ... }
+    pub fn write_builder(
+        &self,
+        format: DataFileFormat,
+        output: OutputFile,
+    ) -> Result<Box<dyn FormatWriteBuilder>> { ... }
+}
+```
+
+Using `DataFileFormat` as a `HashMap` key requires adding `#[derive(Hash)]` to 
the enum. That is a non-breaking addition.
+
+The registry is an owned value, not a global static. Tests construct their 
own. Applications construct one at startup and pass it to scan planners and 
write orchestrators. For the common case of a single registry for the lifetime 
of a process, `default_format_registry()` returns a `&'static FormatRegistry` 
initialized through `OnceLock` on first call.
+
+`read_builder` and `write_builder` return `Err(Error { kind: 
ErrorKind::FeatureUnsupported, .. })` for unregistered formats. The error 
message distinguishes two cases: the format is implemented but its feature flag 
is disabled in this build, or the format has no implementation in this crate.
+
+### Feature flags
+
+Format implementations live in `iceberg::formats::{format}` and are gated 
behind a feature flag per format: `format-parquet` on by default, `format-orc` 
when an ORC implementation lands, and so on. The default feature set includes 
every format the crate implements. Users who build from source and want a 
smaller binary disable what they do not need.
+
+`FormatRegistry::default()` registers every format enabled by the current 
feature set at compile time:
+
+```rust
+impl Default for FormatRegistry {
+    fn default() -> Self {
+        let mut registry = Self::new();
+        #[cfg(feature = "format-parquet")]
+        registry.register(Box::new(ParquetFormatModel::new()));
+        #[cfg(feature = "format-orc")]
+        registry.register(Box::new(OrcFormatModel::new()));
+        registry
+    }
+}
+```
+
+Format implementations live inside the `iceberg` crate rather than as separate 
dependencies. This matches how the Java project keeps its `Parquet`, `Avro`, 
and `ORC` modules inside the `apache/iceberg` repository. In-tree 
implementations let the community control each format's quality and release 
cadence and keep the build graph narrow for downstream consumers. Nothing in 
the trait design prevents a downstream crate from defining its own 
`FormatModel` and registering it with a custom `FormatRegistry`, but in-tree is 
the recommended path for formats intended to ship in iceberg-rust itself.
+
+The `parquet` crate remains an unconditional dependency of `iceberg` for now, 
because non-format code (page index evaluators, row group metric evaluators, 
delete file loaders) still uses `parquet` types directly. A later pass gates 
the `parquet` dependency on the `format-parquet` feature once those callers 
move to format-agnostic abstractions.
+
+### Module layout
+
+```
+crates/iceberg/src/formats/
+├── mod.rs        # FormatModel, FormatReadBuilder, FormatWriteBuilder, 
FormatFileWriter
+├── registry.rs   # FormatRegistry, default_format_registry
+└── parquet.rs    # ParquetFormatModel, wrapping existing ParquetWriter and 
ArrowReader
+```
+
+Additional formats land as `formats/orc/`, `formats/avro_data/`, and so on, 
with directory-per-format for anything larger than a few hundred lines.
+
+In the initial implementation, `ParquetWriter`, `ParquetWriterBuilder`, 
`ArrowReader`, and `ArrowReaderBuilder` stay at their current module paths. 
`ParquetFormatModel` wraps them. Phase 3 of the Migration Plan moves them into 
`formats/parquet/` and retires the old paths.
+
+### Design rationale
+
+#### No generic over the data type
+
+Java's `FormatModel<D, S>` uses `D` for Spark `InternalRow`, Flink `RowData`, 
Arrow `ColumnarBatch`, and other engine-native row types. iceberg-rust has one 
row type: Arrow `RecordBatch`. Every writer accepts it, and every reader 
returns it. No format on the near-term queue (ORC, Avro data-file, Vortex, 
Lance) produces anything other than `RecordBatch` at the Iceberg-facing 
boundary. No engine integration in iceberg-rust today brings an engine-native 
row type the way Spark and Flink do in Java.
+
+Adding a `D` parameter today means writing `<RecordBatch>` everywhere the 
trait is used, for no present caller benefit. If a future format cannot bridge 
to Arrow, the trait can gain an associated type with a default, which is a 
semver-compatible addition.

Review Comment:
   > the trait can gain an associated type with a default
   
   I think we can do it in the initial cut, but I have not investigated the 
required effort for this



##########
docs/rfcs/0003_file_format_api.md:
##########
@@ -0,0 +1,520 @@
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+-->
+
+# RFC: File Format API for Apache Iceberg Rust
+
+## Background
+
+### Current state
+
+The `iceberg` crate (version 0.9.0, Rust 1.92 as of this writing) is the core 
library of the Apache Iceberg Rust project. Its module layout in 
`crates/iceberg/src/lib.rs` exposes public modules for `spec`, `arrow`, 
`writer`, `scan`, `io`, `expr`, `transaction`, and others. The crate depends 
directly on the `parquet` crate (with the `async` feature) and on the `arrow-*` 
crates. It has no feature flags today.
+
+For data file writing, the crate provides a three-layer architecture described 
in `crates/iceberg/src/writer/mod.rs`:
+
+1. **`FileWriter` / `FileWriterBuilder`** in `writer/file_writer/mod.rs`: 
traits for physical file writers, generic over an output type (defaulting to 
`Vec<DataFileBuilder>`). `FileWriterBuilder::build` takes an `OutputFile` and 
returns a `FileWriter`. `FileWriter::write` takes a `&RecordBatch`.
+
+2. **`IcebergWriter` / `IcebergWriterBuilder`** in `writer/mod.rs`: traits for 
logical Iceberg writers (data files, equality deletes, position deletes, 
partitioning), defaulting to `RecordBatch` input and `Vec<DataFile>` output.
+
+3. **Concrete implementations**: `ParquetWriterBuilder` and `ParquetWriter` in 
`writer/file_writer/parquet_writer.rs` are the only `FileWriter` 
implementation. Higher-level writers such as `DataFileWriterBuilder` 
(`writer/base_writer/data_file_writer.rs`) and 
`EqualityDeleteFileWriterBuilder` 
(`writer/base_writer/equality_delete_writer.rs`) are generic over any 
`FileWriterBuilder`, but every example, test, and integration instantiates them 
with `ParquetWriterBuilder`.
+
+The trait layer is format-agnostic. Every concrete instantiation uses Parquet.
+
+For data file reading, the crate provides `ArrowReaderBuilder` and 
`ArrowReader` in `arrow/reader.rs`. Both are Parquet-specific despite the 
generic name. `ArrowReader::process_file_scan_task` calls `open_parquet_file` 
directly, constructs a `ParquetRecordBatchReaderBuilder`, and uses 
Parquet-specific row group filtering and page index logic. 
`TableScan::to_arrow` in `scan/mod.rs` wires `ArrowReaderBuilder` in as the 
only reader path. `FileScanTask` carries a `data_file_format: DataFileFormat` 
field, but `process_file_scan_task` ignores it. `DataFileFormat` appears 
throughout `src/`. Almost every non-test reference is 
`DataFileFormat::Parquet`. The only non-Parquet references are two uses of 
`DataFileFormat::Avro` in `transaction/snapshot.rs` for manifest files.
+
+The `DataFileFormat` enum in `spec/manifest/data_file.rs` has four variants: 
`Avro`, `Orc`, `Parquet`, and `Puffin`. Puffin files hold statistics and 
deletion vectors rather than row data. They are handled in 
`crates/iceberg/src/puffin/` and are out of scope for this RFC (see Non-Goals). 
Data file support today:
+
+| Format  | Data file read | Data file write | Manifests |
+|---------|----------------|-----------------|-----------|
+| Parquet | Yes            | Yes             | No        |
+| Avro    | No             | No              | Yes       |
+| ORC     | No             | No              | No        |
+
+A table containing ORC or Avro data files cannot be read from Rust today, even 
though both are valid per the Iceberg spec.
+
+Arrow `RecordBatch` is the only in-memory data representation in the crate. It 
is the input type for every writer trait and the output type for every reader. 
The `arrow/` module provides schema conversion (`schema_to_arrow_schema`, 
`arrow_schema_to_schema`), `RecordBatchTransformer` for schema evolution and 
constant column injection, and the Parquet-to-Arrow read path. Every 
integration crate, including `iceberg-datafusion`, consumes Arrow. The crate 
does not define a generic `Record` type and does not integrate with 
engine-specific row types such as Java's `InternalRow` or `RowData`.
+
+### Pain points
+
+1. **No extension point for new formats.** Adding ORC means editing 
`ArrowReader` to branch on `DataFileFormat` and threading ORC-specific logic 
through every layer that touches it. The write path has the same shape.
+
+2. **Parquet assumptions leak into generic code.** `ArrowReaderBuilder` 
exposes Parquet-specific options (`with_metadata_size_hint`, 
`with_row_group_filtering_enabled`, `with_row_selection_enabled`) that are 
meaningless for ORC or Avro. The name "ArrowReader" conflates the in-memory 
representation (Arrow) with the on-disk format (Parquet).
+
+3. **No format-agnostic statistics.** `ParquetWriter` computes column sizes, 
value counts, and min/max bounds through `MinMaxColAggregator` and 
`NanValueCountVisitor`, both tightly coupled to Parquet's `Statistics` type. 
Another format cannot produce comparable manifest metadata without a shared 
statistics interface.
+
+4. **V3 types will need per-format serialization.** The V3 spec adds Variant 
and Geometry types. Each format encodes them differently: Parquet uses variant 
shredding, ORC uses binary, Avro uses union types. Implementing either type 
without a format abstraction means new `match` arms in every reader and writer 
code path.
+
+### Prior work
+
+The Java project shipped `FormatModel<D, S>` in February 2026 (PR 
[#12774](https://github.com/apache/iceberg/pull/12774)) after a 10-month 
review. Implementations for Parquet, Avro, ORC, and Arrow followed in PRs 
[#15253](https://github.com/apache/iceberg/pull/15253) through 
[#15258](https://github.com/apache/iceberg/pull/15258). Engine migrations for 
Spark ([#15328](https://github.com/apache/iceberg/pull/15328)) and Flink 
([#15329](https://github.com/apache/iceberg/pull/15329)) landed shortly after.
+
+Java's `FormatModel` carries two generic parameters: a data type `D` (Java 
uses `Record`, Spark `InternalRow`, Flink `RowData`, and Arrow `ColumnarBatch`) 
and an engine schema type `S` (Spark `StructType`, Flink `RowType`, and 
others). A static `FormatModelRegistry` stores implementations keyed by 
`Pair<FileFormat, Class<?>>` and populates itself through Java reflection at 
startup. The old `Parquet.WriteBuilder`, `Avro.WriteBuilder`, and 
`ORC.WriteBuilder` are not deprecated. The new `FormatModel` implementations 
wrap them rather than replace them.
+
+PyIceberg has an open proposal for the equivalent capability in issue 
[apache/iceberg-python#3100](https://github.com/apache/iceberg-python/issues/3100),
 with an in-progress PR at 
[apache/iceberg-python#3119](https://github.com/apache/iceberg-python/pull/3119).
 Because PyIceberg uses PyArrow as its only in-memory representation, the 
proposal drops Java's generic type parameters and keys the registry on file 
format alone. Two prior PyIceberg ORC PRs 
([#790](https://github.com/apache/iceberg-python/pull/790), 
[#2236](https://github.com/apache/iceberg-python/pull/2236)) were closed as 
stale without merging, which reinforces the case for landing an abstraction 
layer before adding new formats.
+
+Design decisions in this RFC that differ from Java (no generics, 
single-dimension registry key, hard cutover from the old Parquet types) are 
justified inline where they appear. Specific Java design points that shaped 
those decisions are called out in Alternatives Considered.
+
+## Goals
+
+The user-facing outcome of this proposal is that every Iceberg data file and 
delete file flows through the same stable and extensible API. Parquet is the 
first format to land. ORC, Avro, and others follow on the same interface. Every 
goal below serves that outcome.
+
+1. Define a `FormatModel` trait that encapsulates format-specific read and 
write behavior independent of the on-disk format.
+
+2. Remove hard-coded Parquet assumptions from scan and write orchestration. 
After this work, `TableScan::to_arrow` and `DataFileWriterBuilder` dispatch 
through the format abstraction instead of constructing Parquet types directly.
+
+3. Provide a registry that maps `DataFileFormat` values to `FormatModel` 
implementations, so callers obtain readers and writers without naming the 
concrete format type.
+
+4. Define a conformance test suite (TCK) that any `FormatModel` implementation 
must pass before it merges.
+
+5. Match the Java and PyIceberg designs where they align, and diverge where 
Rust's single-data-type ecosystem and pre-1.0 status justify it. Divergences 
are called out inline.
+
+## Non-Goals
+
+The items below are deliberately out of scope to keep this proposal focused on 
the abstraction and its Parquet implementation. Most are follow-up work that 
the API enables but does not itself deliver.
+
+1. **Ship new format implementations.** This RFC lands the abstraction and a 
Parquet implementation. ORC, Avro data-file, Vortex, and Lance are follow-up 
RFCs.
+
+2. **Introduce a plugin protocol or runtime library loading.** Rust does not 
offer a clean mechanism for loading compiled plugins at runtime. A 
runtime-linking approach using `libloading` or similar would expand scope 
beyond what the formats currently under discussion (Parquet, ORC, Avro, Vortex, 
Lance) require.
+
+3. **Add Puffin support to the FormatModel.** Puffin files hold statistics and 
deletion vectors rather than row data. They have a different lifecycle from 
data files and are already handled separately in `crates/iceberg/src/puffin/`.
+
+4. **Redesign the writer trait hierarchy.** The existing `IcebergWriter` and 
`FileWriter` layering is sound. This RFC adds a format abstraction beneath 
`FileWriter`, not a replacement for it.
+
+5. **Implement variant shredding or encryption.** Java exposes 
`engineProjection` and `engineSchema` as extension points for variant shredding 
and similar format-specific type mapping, and `withFileEncryptionKey` and 
`withAADPrefix` for Parquet encryption. Equivalent hooks are noted as future 
extensions in the Rust design. Implementing either requires a dedicated RFC.
+
+6. **Change the Iceberg table spec.** This proposal is a Rust-only API change. 
It does not modify the Iceberg spec, the manifest format, the manifest list 
format, or the on-disk layout of any file.
+
+7. **Modify manifest read or write paths.** Manifests and manifest lists 
remain in Avro and are handled by the existing `ManifestReader` and 
`ManifestWriter` paths. The File Format API is about data files and delete 
files only.
+
+
+## Design
+
+The Rust API is three traits and a registry. `FormatModel` is the trait that 
each format implementation provides. `FormatReadBuilder` and 
`FormatWriteBuilder` are the per-operation configurators that `FormatModel` 
returns. `FormatRegistry` maps `DataFileFormat` values to `FormatModel` 
instances. None of the traits carry generic parameters. The subsections below 
introduce each type, and a final "Design rationale" subsection explains the 
choices.
+
+### The FormatModel trait
+
+```rust
+pub trait FormatModel: Send + Sync + 'static {
+    fn format(&self) -> DataFileFormat;
+    fn read_builder(&self, input: InputFile) -> Box<dyn FormatReadBuilder>;
+    fn write_builder(&self, output: OutputFile) -> Box<dyn FormatWriteBuilder>;
+}
+```
+
+Each implementation registers one instance per `DataFileFormat` variant it 
supports. The `format` method returns that variant. `read_builder` and 
`write_builder` are the entry points for reading and writing a file. Both 
return trait objects so that the registry can hand them back from a 
`DataFileFormat`-keyed lookup.
+
+The data type is fixed to Arrow `RecordBatch`, the Iceberg schema type to 
`iceberg::spec::Schema`, and the physical schema type to 
`arrow::datatypes::SchemaRef`. These are the only types in iceberg-rust today 
that would fill the roles Java uses generic parameters for. Arguments for 
keeping them fixed, including comparisons with Java's `FormatModel<D, S>`, are 
in "Design rationale" below.
+
+### The read and write builders
+
+`FormatReadBuilder` and `FormatWriteBuilder` configure a single read or write. 
`FormatModel` produces them, and the caller consumes them with `build`.
+
+```rust
+pub trait FormatReadBuilder: Send {
+    fn project(&mut self, schema: Schema) -> &mut Self;
+    fn filter(&mut self, predicate: BoundPredicate) -> &mut Self;
+    fn split(&mut self, start: u64, length: u64) -> &mut Self;
+    fn case_sensitive(&mut self, case_sensitive: bool) -> &mut Self;
+    fn batch_size(&mut self, batch_size: usize) -> &mut Self;
+    fn build(self: Box<Self>) -> BoxFuture<'static, 
Result<ArrowRecordBatchStream>>;
+}
+
+pub trait FormatWriteBuilder: Send {
+    fn schema(&mut self, schema: Schema) -> &mut Self;
+    fn set(&mut self, key: String, value: String) -> &mut Self;
+    fn build(self: Box<Self>) -> BoxFuture<'static, Result<Box<dyn 
FormatFileWriter>>>;
+}
+```
+
+Both builders take Iceberg `Schema` values. Format implementations convert to 
physical schemas internally using `schema_to_arrow_schema` from 
`arrow/schema.rs`. This deliberately departs from Java's `ReadBuilder<D, S>`, 
which has a separate `engineProjection(S)` method for variant shredding and 
similar per-engine type mapping. Rust's builders expose one projection surface, 
and the hook for a variant-shredding "engine projection" is a future-extension 
point described in "Design rationale."
+
+`FormatWriteBuilder::build` produces a `FormatFileWriter`:
+
+```rust
+pub trait FormatFileWriter: Send {
+    fn write(&mut self, batch: &RecordBatch) -> BoxFuture<'_, Result<()>>;
+    fn close(self: Box<Self>) -> BoxFuture<'static, 
Result<Vec<DataFileBuilder>>>;
+}
+```
+
+The async methods return `BoxFuture` rather than using `async fn` in traits. 
The `self: Box<Self>` signature on `build` and `close` lets those methods 
consume the value while keeping the traits object-safe. Both patterns are 
forced by the trait-object boundary at the registry. The "BoxFuture instead of 
async fn in traits" and "Dynamic dispatch at the registry, static inside the 
format" subsections under "Design rationale" explain why.
+
+### The FormatRegistry
+
+`FormatRegistry` maps `DataFileFormat` values to `FormatModel` instances.
+
+```rust
+pub struct FormatRegistry {
+    models: HashMap<DataFileFormat, Box<dyn FormatModel>>,
+}
+
+impl FormatRegistry {
+    pub fn new() -> Self { ... }
+    pub fn register(&mut self, model: Box<dyn FormatModel>) { ... }
+    pub fn read_builder(
+        &self,
+        format: DataFileFormat,
+        input: InputFile,
+    ) -> Result<Box<dyn FormatReadBuilder>> { ... }
+    pub fn write_builder(
+        &self,
+        format: DataFileFormat,
+        output: OutputFile,
+    ) -> Result<Box<dyn FormatWriteBuilder>> { ... }
+}
+```
+
+Using `DataFileFormat` as a `HashMap` key requires adding `#[derive(Hash)]` to 
the enum. That is a non-breaking addition.
+
+The registry is an owned value, not a global static. Tests construct their 
own. Applications construct one at startup and pass it to scan planners and 
write orchestrators. For the common case of a single registry for the lifetime 
of a process, `default_format_registry()` returns a `&'static FormatRegistry` 
initialized through `OnceLock` on first call.

Review Comment:
   Would love to see more details on scan planner's API changes. Also what 
about writing?



##########
docs/rfcs/0003_file_format_api.md:
##########
@@ -0,0 +1,520 @@
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+-->
+
+# RFC: File Format API for Apache Iceberg Rust
+
+## Background
+
+### Current state
+
+The `iceberg` crate (version 0.9.0, Rust 1.92 as of this writing) is the core 
library of the Apache Iceberg Rust project. Its module layout in 
`crates/iceberg/src/lib.rs` exposes public modules for `spec`, `arrow`, 
`writer`, `scan`, `io`, `expr`, `transaction`, and others. The crate depends 
directly on the `parquet` crate (with the `async` feature) and on the `arrow-*` 
crates. It has no feature flags today.
+
+For data file writing, the crate provides a three-layer architecture described 
in `crates/iceberg/src/writer/mod.rs`:
+
+1. **`FileWriter` / `FileWriterBuilder`** in `writer/file_writer/mod.rs`: 
traits for physical file writers, generic over an output type (defaulting to 
`Vec<DataFileBuilder>`). `FileWriterBuilder::build` takes an `OutputFile` and 
returns a `FileWriter`. `FileWriter::write` takes a `&RecordBatch`.
+
+2. **`IcebergWriter` / `IcebergWriterBuilder`** in `writer/mod.rs`: traits for 
logical Iceberg writers (data files, equality deletes, position deletes, 
partitioning), defaulting to `RecordBatch` input and `Vec<DataFile>` output.
+
+3. **Concrete implementations**: `ParquetWriterBuilder` and `ParquetWriter` in 
`writer/file_writer/parquet_writer.rs` are the only `FileWriter` 
implementation. Higher-level writers such as `DataFileWriterBuilder` 
(`writer/base_writer/data_file_writer.rs`) and 
`EqualityDeleteFileWriterBuilder` 
(`writer/base_writer/equality_delete_writer.rs`) are generic over any 
`FileWriterBuilder`, but every example, test, and integration instantiates them 
with `ParquetWriterBuilder`.
+
+The trait layer is format-agnostic. Every concrete instantiation uses Parquet.
+
+For data file reading, the crate provides `ArrowReaderBuilder` and 
`ArrowReader` in `arrow/reader.rs`. Both are Parquet-specific despite the 
generic name. `ArrowReader::process_file_scan_task` calls `open_parquet_file` 
directly, constructs a `ParquetRecordBatchReaderBuilder`, and uses 
Parquet-specific row group filtering and page index logic. 
`TableScan::to_arrow` in `scan/mod.rs` wires `ArrowReaderBuilder` in as the 
only reader path. `FileScanTask` carries a `data_file_format: DataFileFormat` 
field, but `process_file_scan_task` ignores it. `DataFileFormat` appears 
throughout `src/`. Almost every non-test reference is 
`DataFileFormat::Parquet`. The only non-Parquet references are two uses of 
`DataFileFormat::Avro` in `transaction/snapshot.rs` for manifest files.
+
+The `DataFileFormat` enum in `spec/manifest/data_file.rs` has four variants: 
`Avro`, `Orc`, `Parquet`, and `Puffin`. Puffin files hold statistics and 
deletion vectors rather than row data. They are handled in 
`crates/iceberg/src/puffin/` and are out of scope for this RFC (see Non-Goals). 
Data file support today:
+
+| Format  | Data file read | Data file write | Manifests |
+|---------|----------------|-----------------|-----------|
+| Parquet | Yes            | Yes             | No        |
+| Avro    | No             | No              | Yes       |
+| ORC     | No             | No              | No        |
+
+A table containing ORC or Avro data files cannot be read from Rust today, even 
though both are valid per the Iceberg spec.
+
+Arrow `RecordBatch` is the only in-memory data representation in the crate. It 
is the input type for every writer trait and the output type for every reader. 
The `arrow/` module provides schema conversion (`schema_to_arrow_schema`, 
`arrow_schema_to_schema`), `RecordBatchTransformer` for schema evolution and 
constant column injection, and the Parquet-to-Arrow read path. Every 
integration crate, including `iceberg-datafusion`, consumes Arrow. The crate 
does not define a generic `Record` type and does not integrate with 
engine-specific row types such as Java's `InternalRow` or `RowData`.
+
+### Pain points
+
+1. **No extension point for new formats.** Adding ORC means editing 
`ArrowReader` to branch on `DataFileFormat` and threading ORC-specific logic 
through every layer that touches it. The write path has the same shape.
+
+2. **Parquet assumptions leak into generic code.** `ArrowReaderBuilder` 
exposes Parquet-specific options (`with_metadata_size_hint`, 
`with_row_group_filtering_enabled`, `with_row_selection_enabled`) that are 
meaningless for ORC or Avro. The name "ArrowReader" conflates the in-memory 
representation (Arrow) with the on-disk format (Parquet).
+
+3. **No format-agnostic statistics.** `ParquetWriter` computes column sizes, 
value counts, and min/max bounds through `MinMaxColAggregator` and 
`NanValueCountVisitor`, both tightly coupled to Parquet's `Statistics` type. 
Another format cannot produce comparable manifest metadata without a shared 
statistics interface.
+
+4. **V3 types will need per-format serialization.** The V3 spec adds Variant 
and Geometry types. Each format encodes them differently: Parquet uses variant 
shredding, ORC uses binary, Avro uses union types. Implementing either type 
without a format abstraction means new `match` arms in every reader and writer 
code path.
+
+### Prior work
+
+The Java project shipped `FormatModel<D, S>` in February 2026 (PR 
[#12774](https://github.com/apache/iceberg/pull/12774)) after a 10-month 
review. Implementations for Parquet, Avro, ORC, and Arrow followed in PRs 
[#15253](https://github.com/apache/iceberg/pull/15253) through 
[#15258](https://github.com/apache/iceberg/pull/15258). Engine migrations for 
Spark ([#15328](https://github.com/apache/iceberg/pull/15328)) and Flink 
([#15329](https://github.com/apache/iceberg/pull/15329)) landed shortly after.
+
+Java's `FormatModel` carries two generic parameters: a data type `D` (Java 
uses `Record`, Spark `InternalRow`, Flink `RowData`, and Arrow `ColumnarBatch`) 
and an engine schema type `S` (Spark `StructType`, Flink `RowType`, and 
others). A static `FormatModelRegistry` stores implementations keyed by 
`Pair<FileFormat, Class<?>>` and populates itself through Java reflection at 
startup. The old `Parquet.WriteBuilder`, `Avro.WriteBuilder`, and 
`ORC.WriteBuilder` are not deprecated. The new `FormatModel` implementations 
wrap them rather than replace them.
+
+PyIceberg has an open proposal for the equivalent capability in issue 
[apache/iceberg-python#3100](https://github.com/apache/iceberg-python/issues/3100),
 with an in-progress PR at 
[apache/iceberg-python#3119](https://github.com/apache/iceberg-python/pull/3119).
 Because PyIceberg uses PyArrow as its only in-memory representation, the 
proposal drops Java's generic type parameters and keys the registry on file 
format alone. Two prior PyIceberg ORC PRs 
([#790](https://github.com/apache/iceberg-python/pull/790), 
[#2236](https://github.com/apache/iceberg-python/pull/2236)) were closed as 
stale without merging, which reinforces the case for landing an abstraction 
layer before adding new formats.
+
+Design decisions in this RFC that differ from Java (no generics, 
single-dimension registry key, hard cutover from the old Parquet types) are 
justified inline where they appear. Specific Java design points that shaped 
those decisions are called out in Alternatives Considered.
+
+## Goals
+
+The user-facing outcome of this proposal is that every Iceberg data file and 
delete file flows through the same stable and extensible API. Parquet is the 
first format to land. ORC, Avro, and others follow on the same interface. Every 
goal below serves that outcome.
+
+1. Define a `FormatModel` trait that encapsulates format-specific read and 
write behavior independent of the on-disk format.
+
+2. Remove hard-coded Parquet assumptions from scan and write orchestration. 
After this work, `TableScan::to_arrow` and `DataFileWriterBuilder` dispatch 
through the format abstraction instead of constructing Parquet types directly.
+
+3. Provide a registry that maps `DataFileFormat` values to `FormatModel` 
implementations, so callers obtain readers and writers without naming the 
concrete format type.
+
+4. Define a conformance test suite (TCK) that any `FormatModel` implementation 
must pass before it merges.
+
+5. Match the Java and PyIceberg designs where they align, and diverge where 
Rust's single-data-type ecosystem and pre-1.0 status justify it. Divergences 
are called out inline.
+
+## Non-Goals
+
+The items below are deliberately out of scope to keep this proposal focused on 
the abstraction and its Parquet implementation. Most are follow-up work that 
the API enables but does not itself deliver.
+
+1. **Ship new format implementations.** This RFC lands the abstraction and a 
Parquet implementation. ORC, Avro data-file, Vortex, and Lance are follow-up 
RFCs.
+
+2. **Introduce a plugin protocol or runtime library loading.** Rust does not 
offer a clean mechanism for loading compiled plugins at runtime. A 
runtime-linking approach using `libloading` or similar would expand scope 
beyond what the formats currently under discussion (Parquet, ORC, Avro, Vortex, 
Lance) require.
+
+3. **Add Puffin support to the FormatModel.** Puffin files hold statistics and 
deletion vectors rather than row data. They have a different lifecycle from 
data files and are already handled separately in `crates/iceberg/src/puffin/`.
+
+4. **Redesign the writer trait hierarchy.** The existing `IcebergWriter` and 
`FileWriter` layering is sound. This RFC adds a format abstraction beneath 
`FileWriter`, not a replacement for it.
+
+5. **Implement variant shredding or encryption.** Java exposes 
`engineProjection` and `engineSchema` as extension points for variant shredding 
and similar format-specific type mapping, and `withFileEncryptionKey` and 
`withAADPrefix` for Parquet encryption. Equivalent hooks are noted as future 
extensions in the Rust design. Implementing either requires a dedicated RFC.
+
+6. **Change the Iceberg table spec.** This proposal is a Rust-only API change. 
It does not modify the Iceberg spec, the manifest format, the manifest list 
format, or the on-disk layout of any file.
+
+7. **Modify manifest read or write paths.** Manifests and manifest lists 
remain in Avro and are handled by the existing `ManifestReader` and 
`ManifestWriter` paths. The File Format API is about data files and delete 
files only.
+
+
+## Design
+
+The Rust API is three traits and a registry. `FormatModel` is the trait that 
each format implementation provides. `FormatReadBuilder` and 
`FormatWriteBuilder` are the per-operation configurators that `FormatModel` 
returns. `FormatRegistry` maps `DataFileFormat` values to `FormatModel` 
instances. None of the traits carry generic parameters. The subsections below 
introduce each type, and a final "Design rationale" subsection explains the 
choices.
+
+### The FormatModel trait
+
+```rust
+pub trait FormatModel: Send + Sync + 'static {
+    fn format(&self) -> DataFileFormat;
+    fn read_builder(&self, input: InputFile) -> Box<dyn FormatReadBuilder>;
+    fn write_builder(&self, output: OutputFile) -> Box<dyn FormatWriteBuilder>;
+}
+```
+
+Each implementation registers one instance per `DataFileFormat` variant it 
supports. The `format` method returns that variant. `read_builder` and 
`write_builder` are the entry points for reading and writing a file. Both 
return trait objects so that the registry can hand them back from a 
`DataFileFormat`-keyed lookup.
+
+The data type is fixed to Arrow `RecordBatch`, the Iceberg schema type to 
`iceberg::spec::Schema`, and the physical schema type to 
`arrow::datatypes::SchemaRef`. These are the only types in iceberg-rust today 
that would fill the roles Java uses generic parameters for. Arguments for 
keeping them fixed, including comparisons with Java's `FormatModel<D, S>`, are 
in "Design rationale" below.
+
+### The read and write builders
+
+`FormatReadBuilder` and `FormatWriteBuilder` configure a single read or write. 
`FormatModel` produces them, and the caller consumes them with `build`.
+
+```rust
+pub trait FormatReadBuilder: Send {
+    fn project(&mut self, schema: Schema) -> &mut Self;
+    fn filter(&mut self, predicate: BoundPredicate) -> &mut Self;
+    fn split(&mut self, start: u64, length: u64) -> &mut Self;
+    fn case_sensitive(&mut self, case_sensitive: bool) -> &mut Self;
+    fn batch_size(&mut self, batch_size: usize) -> &mut Self;
+    fn build(self: Box<Self>) -> BoxFuture<'static, 
Result<ArrowRecordBatchStream>>;
+}
+
+pub trait FormatWriteBuilder: Send {
+    fn schema(&mut self, schema: Schema) -> &mut Self;
+    fn set(&mut self, key: String, value: String) -> &mut Self;
+    fn build(self: Box<Self>) -> BoxFuture<'static, Result<Box<dyn 
FormatFileWriter>>>;
+}
+```
+
+Both builders take Iceberg `Schema` values. Format implementations convert to 
physical schemas internally using `schema_to_arrow_schema` from 
`arrow/schema.rs`. This deliberately departs from Java's `ReadBuilder<D, S>`, 
which has a separate `engineProjection(S)` method for variant shredding and 
similar per-engine type mapping. Rust's builders expose one projection surface, 
and the hook for a variant-shredding "engine projection" is a future-extension 
point described in "Design rationale."
+
+`FormatWriteBuilder::build` produces a `FormatFileWriter`:
+
+```rust
+pub trait FormatFileWriter: Send {
+    fn write(&mut self, batch: &RecordBatch) -> BoxFuture<'_, Result<()>>;
+    fn close(self: Box<Self>) -> BoxFuture<'static, 
Result<Vec<DataFileBuilder>>>;
+}
+```
+
+The async methods return `BoxFuture` rather than using `async fn` in traits. 
The `self: Box<Self>` signature on `build` and `close` lets those methods 
consume the value while keeping the traits object-safe. Both patterns are 
forced by the trait-object boundary at the registry. The "BoxFuture instead of 
async fn in traits" and "Dynamic dispatch at the registry, static inside the 
format" subsections under "Design rationale" explain why.
+
+### The FormatRegistry
+
+`FormatRegistry` maps `DataFileFormat` values to `FormatModel` instances.
+
+```rust
+pub struct FormatRegistry {
+    models: HashMap<DataFileFormat, Box<dyn FormatModel>>,
+}
+
+impl FormatRegistry {
+    pub fn new() -> Self { ... }
+    pub fn register(&mut self, model: Box<dyn FormatModel>) { ... }
+    pub fn read_builder(
+        &self,
+        format: DataFileFormat,
+        input: InputFile,
+    ) -> Result<Box<dyn FormatReadBuilder>> { ... }
+    pub fn write_builder(
+        &self,
+        format: DataFileFormat,
+        output: OutputFile,
+    ) -> Result<Box<dyn FormatWriteBuilder>> { ... }
+}
+```
+
+Using `DataFileFormat` as a `HashMap` key requires adding `#[derive(Hash)]` to 
the enum. That is a non-breaking addition.
+
+The registry is an owned value, not a global static. Tests construct their 
own. Applications construct one at startup and pass it to scan planners and 
write orchestrators. For the common case of a single registry for the lifetime 
of a process, `default_format_registry()` returns a `&'static FormatRegistry` 
initialized through `OnceLock` on first call.

Review Comment:
   Or we can inject format registry to Catalog instances directly?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] rfc: Implement an API for all Data File Formats [iceberg-rust]

Reply via email to