Re: [PR] rfc: Implement an API for all Data File Formats [iceberg-rust]

via GitHub Wed, 29 Apr 2026 03:00:15 -0700


xanderbailey commented on code in PR #2384:
URL: https://github.com/apache/iceberg-rust/pull/2384#discussion_r3160092754



##########
docs/rfcs/0003_file_format_api.md:
##########
@@ -0,0 +1,520 @@
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+-->
+
+# RFC: File Format API for Apache Iceberg Rust
+
+## Background
+
+### Current state
+
+The `iceberg` crate (version 0.9.0, Rust 1.92 as of this writing) is the core 
library of the Apache Iceberg Rust project. Its module layout in 
`crates/iceberg/src/lib.rs` exposes public modules for `spec`, `arrow`, 
`writer`, `scan`, `io`, `expr`, `transaction`, and others. The crate depends 
directly on the `parquet` crate (with the `async` feature) and on the `arrow-*` 
crates. It has no feature flags today.
+
+For data file writing, the crate provides a three-layer architecture described 
in `crates/iceberg/src/writer/mod.rs`:
+
+1. **`FileWriter` / `FileWriterBuilder`** in `writer/file_writer/mod.rs`: 
traits for physical file writers, generic over an output type (defaulting to 
`Vec<DataFileBuilder>`). `FileWriterBuilder::build` takes an `OutputFile` and 
returns a `FileWriter`. `FileWriter::write` takes a `&RecordBatch`.
+
+2. **`IcebergWriter` / `IcebergWriterBuilder`** in `writer/mod.rs`: traits for 
logical Iceberg writers (data files, equality deletes, position deletes, 
partitioning), defaulting to `RecordBatch` input and `Vec<DataFile>` output.
+
+3. **Concrete implementations**: `ParquetWriterBuilder` and `ParquetWriter` in 
`writer/file_writer/parquet_writer.rs` are the only `FileWriter` 
implementation. Higher-level writers such as `DataFileWriterBuilder` 
(`writer/base_writer/data_file_writer.rs`) and 
`EqualityDeleteFileWriterBuilder` 
(`writer/base_writer/equality_delete_writer.rs`) are generic over any 
`FileWriterBuilder`, but every example, test, and integration instantiates them 
with `ParquetWriterBuilder`.
+
+The trait layer is format-agnostic. Every concrete instantiation uses Parquet.
+
+For data file reading, the crate provides `ArrowReaderBuilder` and 
`ArrowReader` in `arrow/reader.rs`. Both are Parquet-specific despite the 
generic name. `ArrowReader::process_file_scan_task` calls `open_parquet_file` 
directly, constructs a `ParquetRecordBatchReaderBuilder`, and uses 
Parquet-specific row group filtering and page index logic. 
`TableScan::to_arrow` in `scan/mod.rs` wires `ArrowReaderBuilder` in as the 
only reader path. `FileScanTask` carries a `data_file_format: DataFileFormat` 
field, but `process_file_scan_task` ignores it. `DataFileFormat` appears 
throughout `src/`. Almost every non-test reference is 
`DataFileFormat::Parquet`. The only non-Parquet references are two uses of 
`DataFileFormat::Avro` in `transaction/snapshot.rs` for manifest files.
+
+The `DataFileFormat` enum in `spec/manifest/data_file.rs` has four variants: 
`Avro`, `Orc`, `Parquet`, and `Puffin`. Puffin files hold statistics and 
deletion vectors rather than row data. They are handled in 
`crates/iceberg/src/puffin/` and are out of scope for this RFC (see Non-Goals). 
Data file support today:
+
+| Format  | Data file read | Data file write | Manifests |
+|---------|----------------|-----------------|-----------|
+| Parquet | Yes            | Yes             | No        |
+| Avro    | No             | No              | Yes       |
+| ORC     | No             | No              | No        |
+
+A table containing ORC or Avro data files cannot be read from Rust today, even 
though both are valid per the Iceberg spec.
+
+Arrow `RecordBatch` is the only in-memory data representation in the crate. It 
is the input type for every writer trait and the output type for every reader. 
The `arrow/` module provides schema conversion (`schema_to_arrow_schema`, 
`arrow_schema_to_schema`), `RecordBatchTransformer` for schema evolution and 
constant column injection, and the Parquet-to-Arrow read path. Every 
integration crate, including `iceberg-datafusion`, consumes Arrow. The crate 
does not define a generic `Record` type and does not integrate with 
engine-specific row types such as Java's `InternalRow` or `RowData`.
+
+### Pain points
+
+1. **No extension point for new formats.** Adding ORC means editing 
`ArrowReader` to branch on `DataFileFormat` and threading ORC-specific logic 
through every layer that touches it. The write path has the same shape.
+
+2. **Parquet assumptions leak into generic code.** `ArrowReaderBuilder` 
exposes Parquet-specific options (`with_metadata_size_hint`, 
`with_row_group_filtering_enabled`, `with_row_selection_enabled`) that are 
meaningless for ORC or Avro. The name "ArrowReader" conflates the in-memory 
representation (Arrow) with the on-disk format (Parquet).
+
+3. **No format-agnostic statistics.** `ParquetWriter` computes column sizes, 
value counts, and min/max bounds through `MinMaxColAggregator` and 
`NanValueCountVisitor`, both tightly coupled to Parquet's `Statistics` type. 
Another format cannot produce comparable manifest metadata without a shared 
statistics interface.
+
+4. **V3 types will need per-format serialization.** The V3 spec adds Variant 
and Geometry types. Each format encodes them differently: Parquet uses variant 
shredding, ORC uses binary, Avro uses union types. Implementing either type 
without a format abstraction means new `match` arms in every reader and writer 
code path.
+
+### Prior work
+
+The Java project shipped `FormatModel<D, S>` in February 2026 (PR 
[#12774](https://github.com/apache/iceberg/pull/12774)) after a 10-month 
review. Implementations for Parquet, Avro, ORC, and Arrow followed in PRs 
[#15253](https://github.com/apache/iceberg/pull/15253) through 
[#15258](https://github.com/apache/iceberg/pull/15258). Engine migrations for 
Spark ([#15328](https://github.com/apache/iceberg/pull/15328)) and Flink 
([#15329](https://github.com/apache/iceberg/pull/15329)) landed shortly after.
+
+Java's `FormatModel` carries two generic parameters: a data type `D` (Java 
uses `Record`, Spark `InternalRow`, Flink `RowData`, and Arrow `ColumnarBatch`) 
and an engine schema type `S` (Spark `StructType`, Flink `RowType`, and 
others). A static `FormatModelRegistry` stores implementations keyed by 
`Pair<FileFormat, Class<?>>` and populates itself through Java reflection at 
startup. The old `Parquet.WriteBuilder`, `Avro.WriteBuilder`, and 
`ORC.WriteBuilder` are not deprecated. The new `FormatModel` implementations 
wrap them rather than replace them.
+
+PyIceberg has an open proposal for the equivalent capability in issue 
[apache/iceberg-python#3100](https://github.com/apache/iceberg-python/issues/3100),
 with an in-progress PR at 
[apache/iceberg-python#3119](https://github.com/apache/iceberg-python/pull/3119).
 Because PyIceberg uses PyArrow as its only in-memory representation, the 
proposal drops Java's generic type parameters and keys the registry on file 
format alone. Two prior PyIceberg ORC PRs 
([#790](https://github.com/apache/iceberg-python/pull/790), 
[#2236](https://github.com/apache/iceberg-python/pull/2236)) were closed as 
stale without merging, which reinforces the case for landing an abstraction 
layer before adding new formats.
+
+Design decisions in this RFC that differ from Java (no generics, 
single-dimension registry key, hard cutover from the old Parquet types) are 
justified inline where they appear. Specific Java design points that shaped 
those decisions are called out in Alternatives Considered.
+
+## Goals
+
+The user-facing outcome of this proposal is that every Iceberg data file and 
delete file flows through the same stable and extensible API. Parquet is the 
first format to land. ORC, Avro, and others follow on the same interface. Every 
goal below serves that outcome.
+
+1. Define a `FormatModel` trait that encapsulates format-specific read and 
write behavior independent of the on-disk format.
+
+2. Remove hard-coded Parquet assumptions from scan and write orchestration. 
After this work, `TableScan::to_arrow` and `DataFileWriterBuilder` dispatch 
through the format abstraction instead of constructing Parquet types directly.
+
+3. Provide a registry that maps `DataFileFormat` values to `FormatModel` 
implementations, so callers obtain readers and writers without naming the 
concrete format type.
+
+4. Define a conformance test suite (TCK) that any `FormatModel` implementation 
must pass before it merges.
+
+5. Match the Java and PyIceberg designs where they align, and diverge where 
Rust's single-data-type ecosystem and pre-1.0 status justify it. Divergences 
are called out inline.
+
+## Non-Goals
+
+The items below are deliberately out of scope to keep this proposal focused on 
the abstraction and its Parquet implementation. Most are follow-up work that 
the API enables but does not itself deliver.
+
+1. **Ship new format implementations.** This RFC lands the abstraction and a 
Parquet implementation. ORC, Avro data-file, Vortex, and Lance are follow-up 
RFCs.
+
+2. **Introduce a plugin protocol or runtime library loading.** Rust does not 
offer a clean mechanism for loading compiled plugins at runtime. A 
runtime-linking approach using `libloading` or similar would expand scope 
beyond what the formats currently under discussion (Parquet, ORC, Avro, Vortex, 
Lance) require.
+
+3. **Add Puffin support to the FormatModel.** Puffin files hold statistics and 
deletion vectors rather than row data. They have a different lifecycle from 
data files and are already handled separately in `crates/iceberg/src/puffin/`.
+
+4. **Redesign the writer trait hierarchy.** The existing `IcebergWriter` and 
`FileWriter` layering is sound. This RFC adds a format abstraction beneath 
`FileWriter`, not a replacement for it.
+
+5. **Implement variant shredding or encryption.** Java exposes 
`engineProjection` and `engineSchema` as extension points for variant shredding 
and similar format-specific type mapping, and `withFileEncryptionKey` and 
`withAADPrefix` for Parquet encryption. Equivalent hooks are noted as future 
extensions in the Rust design. Implementing either requires a dedicated RFC.

Review Comment:
   Encryption work is currently in-flight so I'd love to understand how we 
could incorporate that in this plan. 
   
   ```
     let mut wb = registry.write_builder(format, output)?;
     if let Some(em) = &self.encryption_manager {
         let (dek, aad) = em.create_file_key().await?;
         wb.with_file_encryption_key(&dek);
         wb.with_aad_prefix(&aad);
     }
     let writer = wb.build().await?;
   ``
   I need to think if something like this might make sense?



##########
docs/rfcs/0003_file_format_api.md:
##########
@@ -0,0 +1,520 @@
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+-->
+
+# RFC: File Format API for Apache Iceberg Rust
+
+## Background
+
+### Current state
+
+The `iceberg` crate (version 0.9.0, Rust 1.92 as of this writing) is the core 
library of the Apache Iceberg Rust project. Its module layout in 
`crates/iceberg/src/lib.rs` exposes public modules for `spec`, `arrow`, 
`writer`, `scan`, `io`, `expr`, `transaction`, and others. The crate depends 
directly on the `parquet` crate (with the `async` feature) and on the `arrow-*` 
crates. It has no feature flags today.
+
+For data file writing, the crate provides a three-layer architecture described 
in `crates/iceberg/src/writer/mod.rs`:
+
+1. **`FileWriter` / `FileWriterBuilder`** in `writer/file_writer/mod.rs`: 
traits for physical file writers, generic over an output type (defaulting to 
`Vec<DataFileBuilder>`). `FileWriterBuilder::build` takes an `OutputFile` and 
returns a `FileWriter`. `FileWriter::write` takes a `&RecordBatch`.
+
+2. **`IcebergWriter` / `IcebergWriterBuilder`** in `writer/mod.rs`: traits for 
logical Iceberg writers (data files, equality deletes, position deletes, 
partitioning), defaulting to `RecordBatch` input and `Vec<DataFile>` output.
+
+3. **Concrete implementations**: `ParquetWriterBuilder` and `ParquetWriter` in 
`writer/file_writer/parquet_writer.rs` are the only `FileWriter` 
implementation. Higher-level writers such as `DataFileWriterBuilder` 
(`writer/base_writer/data_file_writer.rs`) and 
`EqualityDeleteFileWriterBuilder` 
(`writer/base_writer/equality_delete_writer.rs`) are generic over any 
`FileWriterBuilder`, but every example, test, and integration instantiates them 
with `ParquetWriterBuilder`.
+
+The trait layer is format-agnostic. Every concrete instantiation uses Parquet.
+
+For data file reading, the crate provides `ArrowReaderBuilder` and 
`ArrowReader` in `arrow/reader.rs`. Both are Parquet-specific despite the 
generic name. `ArrowReader::process_file_scan_task` calls `open_parquet_file` 
directly, constructs a `ParquetRecordBatchReaderBuilder`, and uses 
Parquet-specific row group filtering and page index logic. 
`TableScan::to_arrow` in `scan/mod.rs` wires `ArrowReaderBuilder` in as the 
only reader path. `FileScanTask` carries a `data_file_format: DataFileFormat` 
field, but `process_file_scan_task` ignores it. `DataFileFormat` appears 
throughout `src/`. Almost every non-test reference is 
`DataFileFormat::Parquet`. The only non-Parquet references are two uses of 
`DataFileFormat::Avro` in `transaction/snapshot.rs` for manifest files.
+
+The `DataFileFormat` enum in `spec/manifest/data_file.rs` has four variants: 
`Avro`, `Orc`, `Parquet`, and `Puffin`. Puffin files hold statistics and 
deletion vectors rather than row data. They are handled in 
`crates/iceberg/src/puffin/` and are out of scope for this RFC (see Non-Goals). 
Data file support today:
+
+| Format  | Data file read | Data file write | Manifests |
+|---------|----------------|-----------------|-----------|
+| Parquet | Yes            | Yes             | No        |
+| Avro    | No             | No              | Yes       |
+| ORC     | No             | No              | No        |
+
+A table containing ORC or Avro data files cannot be read from Rust today, even 
though both are valid per the Iceberg spec.
+
+Arrow `RecordBatch` is the only in-memory data representation in the crate. It 
is the input type for every writer trait and the output type for every reader. 
The `arrow/` module provides schema conversion (`schema_to_arrow_schema`, 
`arrow_schema_to_schema`), `RecordBatchTransformer` for schema evolution and 
constant column injection, and the Parquet-to-Arrow read path. Every 
integration crate, including `iceberg-datafusion`, consumes Arrow. The crate 
does not define a generic `Record` type and does not integrate with 
engine-specific row types such as Java's `InternalRow` or `RowData`.
+
+### Pain points
+
+1. **No extension point for new formats.** Adding ORC means editing 
`ArrowReader` to branch on `DataFileFormat` and threading ORC-specific logic 
through every layer that touches it. The write path has the same shape.
+
+2. **Parquet assumptions leak into generic code.** `ArrowReaderBuilder` 
exposes Parquet-specific options (`with_metadata_size_hint`, 
`with_row_group_filtering_enabled`, `with_row_selection_enabled`) that are 
meaningless for ORC or Avro. The name "ArrowReader" conflates the in-memory 
representation (Arrow) with the on-disk format (Parquet).
+
+3. **No format-agnostic statistics.** `ParquetWriter` computes column sizes, 
value counts, and min/max bounds through `MinMaxColAggregator` and 
`NanValueCountVisitor`, both tightly coupled to Parquet's `Statistics` type. 
Another format cannot produce comparable manifest metadata without a shared 
statistics interface.
+
+4. **V3 types will need per-format serialization.** The V3 spec adds Variant 
and Geometry types. Each format encodes them differently: Parquet uses variant 
shredding, ORC uses binary, Avro uses union types. Implementing either type 
without a format abstraction means new `match` arms in every reader and writer 
code path.
+
+### Prior work
+
+The Java project shipped `FormatModel<D, S>` in February 2026 (PR 
[#12774](https://github.com/apache/iceberg/pull/12774)) after a 10-month 
review. Implementations for Parquet, Avro, ORC, and Arrow followed in PRs 
[#15253](https://github.com/apache/iceberg/pull/15253) through 
[#15258](https://github.com/apache/iceberg/pull/15258). Engine migrations for 
Spark ([#15328](https://github.com/apache/iceberg/pull/15328)) and Flink 
([#15329](https://github.com/apache/iceberg/pull/15329)) landed shortly after.
+
+Java's `FormatModel` carries two generic parameters: a data type `D` (Java 
uses `Record`, Spark `InternalRow`, Flink `RowData`, and Arrow `ColumnarBatch`) 
and an engine schema type `S` (Spark `StructType`, Flink `RowType`, and 
others). A static `FormatModelRegistry` stores implementations keyed by 
`Pair<FileFormat, Class<?>>` and populates itself through Java reflection at 
startup. The old `Parquet.WriteBuilder`, `Avro.WriteBuilder`, and 
`ORC.WriteBuilder` are not deprecated. The new `FormatModel` implementations 
wrap them rather than replace them.
+
+PyIceberg has an open proposal for the equivalent capability in issue 
[apache/iceberg-python#3100](https://github.com/apache/iceberg-python/issues/3100),
 with an in-progress PR at 
[apache/iceberg-python#3119](https://github.com/apache/iceberg-python/pull/3119).
 Because PyIceberg uses PyArrow as its only in-memory representation, the 
proposal drops Java's generic type parameters and keys the registry on file 
format alone. Two prior PyIceberg ORC PRs 
([#790](https://github.com/apache/iceberg-python/pull/790), 
[#2236](https://github.com/apache/iceberg-python/pull/2236)) were closed as 
stale without merging, which reinforces the case for landing an abstraction 
layer before adding new formats.
+
+Design decisions in this RFC that differ from Java (no generics, 
single-dimension registry key, hard cutover from the old Parquet types) are 
justified inline where they appear. Specific Java design points that shaped 
those decisions are called out in Alternatives Considered.
+
+## Goals
+
+The user-facing outcome of this proposal is that every Iceberg data file and 
delete file flows through the same stable and extensible API. Parquet is the 
first format to land. ORC, Avro, and others follow on the same interface. Every 
goal below serves that outcome.
+
+1. Define a `FormatModel` trait that encapsulates format-specific read and 
write behavior independent of the on-disk format.
+
+2. Remove hard-coded Parquet assumptions from scan and write orchestration. 
After this work, `TableScan::to_arrow` and `DataFileWriterBuilder` dispatch 
through the format abstraction instead of constructing Parquet types directly.
+
+3. Provide a registry that maps `DataFileFormat` values to `FormatModel` 
implementations, so callers obtain readers and writers without naming the 
concrete format type.
+
+4. Define a conformance test suite (TCK) that any `FormatModel` implementation 
must pass before it merges.
+
+5. Match the Java and PyIceberg designs where they align, and diverge where 
Rust's single-data-type ecosystem and pre-1.0 status justify it. Divergences 
are called out inline.
+
+## Non-Goals
+
+The items below are deliberately out of scope to keep this proposal focused on 
the abstraction and its Parquet implementation. Most are follow-up work that 
the API enables but does not itself deliver.
+
+1. **Ship new format implementations.** This RFC lands the abstraction and a 
Parquet implementation. ORC, Avro data-file, Vortex, and Lance are follow-up 
RFCs.
+
+2. **Introduce a plugin protocol or runtime library loading.** Rust does not 
offer a clean mechanism for loading compiled plugins at runtime. A 
runtime-linking approach using `libloading` or similar would expand scope 
beyond what the formats currently under discussion (Parquet, ORC, Avro, Vortex, 
Lance) require.
+
+3. **Add Puffin support to the FormatModel.** Puffin files hold statistics and 
deletion vectors rather than row data. They have a different lifecycle from 
data files and are already handled separately in `crates/iceberg/src/puffin/`.
+
+4. **Redesign the writer trait hierarchy.** The existing `IcebergWriter` and 
`FileWriter` layering is sound. This RFC adds a format abstraction beneath 
`FileWriter`, not a replacement for it.
+
+5. **Implement variant shredding or encryption.** Java exposes 
`engineProjection` and `engineSchema` as extension points for variant shredding 
and similar format-specific type mapping, and `withFileEncryptionKey` and 
`withAADPrefix` for Parquet encryption. Equivalent hooks are noted as future 
extensions in the Rust design. Implementing either requires a dedicated RFC.

Review Comment:
   Encryption work is currently in-flight so I'd love to understand how we 
could incorporate that in this plan. 
   
   ```
     let mut wb = registry.write_builder(format, output)?;
     if let Some(em) = &self.encryption_manager {
         let (dek, aad) = em.create_file_key().await?;
         wb.with_file_encryption_key(&dek);
         wb.with_aad_prefix(&aad);
     }
     let writer = wb.build().await?;
   ```
   
   I need to think if something like this might make sense?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] rfc: Implement an API for all Data File Formats [iceberg-rust]

Reply via email to