xanderbailey commented on code in PR #2384: URL: https://github.com/apache/iceberg-rust/pull/2384#discussion_r3160092754
########## docs/rfcs/0003_file_format_api.md: ########## @@ -0,0 +1,520 @@ +<!-- + ~ Licensed to the Apache Software Foundation (ASF) under one + ~ or more contributor license agreements. See the NOTICE file + ~ distributed with this work for additional information + ~ regarding copyright ownership. The ASF licenses this file + ~ to you under the Apache License, Version 2.0 (the + ~ "License"); you may not use this file except in compliance + ~ with the License. You may obtain a copy of the License at + ~ + ~ http://www.apache.org/licenses/LICENSE-2.0 + ~ + ~ Unless required by applicable law or agreed to in writing, + ~ software distributed under the License is distributed on an + ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + ~ KIND, either express or implied. See the License for the + ~ specific language governing permissions and limitations + ~ under the License. +--> + +# RFC: File Format API for Apache Iceberg Rust + +## Background + +### Current state + +The `iceberg` crate (version 0.9.0, Rust 1.92 as of this writing) is the core library of the Apache Iceberg Rust project. Its module layout in `crates/iceberg/src/lib.rs` exposes public modules for `spec`, `arrow`, `writer`, `scan`, `io`, `expr`, `transaction`, and others. The crate depends directly on the `parquet` crate (with the `async` feature) and on the `arrow-*` crates. It has no feature flags today. + +For data file writing, the crate provides a three-layer architecture described in `crates/iceberg/src/writer/mod.rs`: + +1. **`FileWriter` / `FileWriterBuilder`** in `writer/file_writer/mod.rs`: traits for physical file writers, generic over an output type (defaulting to `Vec<DataFileBuilder>`). `FileWriterBuilder::build` takes an `OutputFile` and returns a `FileWriter`. `FileWriter::write` takes a `&RecordBatch`. + +2. **`IcebergWriter` / `IcebergWriterBuilder`** in `writer/mod.rs`: traits for logical Iceberg writers (data files, equality deletes, position deletes, partitioning), defaulting to `RecordBatch` input and `Vec<DataFile>` output. + +3. **Concrete implementations**: `ParquetWriterBuilder` and `ParquetWriter` in `writer/file_writer/parquet_writer.rs` are the only `FileWriter` implementation. Higher-level writers such as `DataFileWriterBuilder` (`writer/base_writer/data_file_writer.rs`) and `EqualityDeleteFileWriterBuilder` (`writer/base_writer/equality_delete_writer.rs`) are generic over any `FileWriterBuilder`, but every example, test, and integration instantiates them with `ParquetWriterBuilder`. + +The trait layer is format-agnostic. Every concrete instantiation uses Parquet. + +For data file reading, the crate provides `ArrowReaderBuilder` and `ArrowReader` in `arrow/reader.rs`. Both are Parquet-specific despite the generic name. `ArrowReader::process_file_scan_task` calls `open_parquet_file` directly, constructs a `ParquetRecordBatchReaderBuilder`, and uses Parquet-specific row group filtering and page index logic. `TableScan::to_arrow` in `scan/mod.rs` wires `ArrowReaderBuilder` in as the only reader path. `FileScanTask` carries a `data_file_format: DataFileFormat` field, but `process_file_scan_task` ignores it. `DataFileFormat` appears throughout `src/`. Almost every non-test reference is `DataFileFormat::Parquet`. The only non-Parquet references are two uses of `DataFileFormat::Avro` in `transaction/snapshot.rs` for manifest files. + +The `DataFileFormat` enum in `spec/manifest/data_file.rs` has four variants: `Avro`, `Orc`, `Parquet`, and `Puffin`. Puffin files hold statistics and deletion vectors rather than row data. They are handled in `crates/iceberg/src/puffin/` and are out of scope for this RFC (see Non-Goals). Data file support today: + +| Format | Data file read | Data file write | Manifests | +|---------|----------------|-----------------|-----------| +| Parquet | Yes | Yes | No | +| Avro | No | No | Yes | +| ORC | No | No | No | + +A table containing ORC or Avro data files cannot be read from Rust today, even though both are valid per the Iceberg spec. + +Arrow `RecordBatch` is the only in-memory data representation in the crate. It is the input type for every writer trait and the output type for every reader. The `arrow/` module provides schema conversion (`schema_to_arrow_schema`, `arrow_schema_to_schema`), `RecordBatchTransformer` for schema evolution and constant column injection, and the Parquet-to-Arrow read path. Every integration crate, including `iceberg-datafusion`, consumes Arrow. The crate does not define a generic `Record` type and does not integrate with engine-specific row types such as Java's `InternalRow` or `RowData`. + +### Pain points + +1. **No extension point for new formats.** Adding ORC means editing `ArrowReader` to branch on `DataFileFormat` and threading ORC-specific logic through every layer that touches it. The write path has the same shape. + +2. **Parquet assumptions leak into generic code.** `ArrowReaderBuilder` exposes Parquet-specific options (`with_metadata_size_hint`, `with_row_group_filtering_enabled`, `with_row_selection_enabled`) that are meaningless for ORC or Avro. The name "ArrowReader" conflates the in-memory representation (Arrow) with the on-disk format (Parquet). + +3. **No format-agnostic statistics.** `ParquetWriter` computes column sizes, value counts, and min/max bounds through `MinMaxColAggregator` and `NanValueCountVisitor`, both tightly coupled to Parquet's `Statistics` type. Another format cannot produce comparable manifest metadata without a shared statistics interface. + +4. **V3 types will need per-format serialization.** The V3 spec adds Variant and Geometry types. Each format encodes them differently: Parquet uses variant shredding, ORC uses binary, Avro uses union types. Implementing either type without a format abstraction means new `match` arms in every reader and writer code path. + +### Prior work + +The Java project shipped `FormatModel<D, S>` in February 2026 (PR [#12774](https://github.com/apache/iceberg/pull/12774)) after a 10-month review. Implementations for Parquet, Avro, ORC, and Arrow followed in PRs [#15253](https://github.com/apache/iceberg/pull/15253) through [#15258](https://github.com/apache/iceberg/pull/15258). Engine migrations for Spark ([#15328](https://github.com/apache/iceberg/pull/15328)) and Flink ([#15329](https://github.com/apache/iceberg/pull/15329)) landed shortly after. + +Java's `FormatModel` carries two generic parameters: a data type `D` (Java uses `Record`, Spark `InternalRow`, Flink `RowData`, and Arrow `ColumnarBatch`) and an engine schema type `S` (Spark `StructType`, Flink `RowType`, and others). A static `FormatModelRegistry` stores implementations keyed by `Pair<FileFormat, Class<?>>` and populates itself through Java reflection at startup. The old `Parquet.WriteBuilder`, `Avro.WriteBuilder`, and `ORC.WriteBuilder` are not deprecated. The new `FormatModel` implementations wrap them rather than replace them. + +PyIceberg has an open proposal for the equivalent capability in issue [apache/iceberg-python#3100](https://github.com/apache/iceberg-python/issues/3100), with an in-progress PR at [apache/iceberg-python#3119](https://github.com/apache/iceberg-python/pull/3119). Because PyIceberg uses PyArrow as its only in-memory representation, the proposal drops Java's generic type parameters and keys the registry on file format alone. Two prior PyIceberg ORC PRs ([#790](https://github.com/apache/iceberg-python/pull/790), [#2236](https://github.com/apache/iceberg-python/pull/2236)) were closed as stale without merging, which reinforces the case for landing an abstraction layer before adding new formats. + +Design decisions in this RFC that differ from Java (no generics, single-dimension registry key, hard cutover from the old Parquet types) are justified inline where they appear. Specific Java design points that shaped those decisions are called out in Alternatives Considered. + +## Goals + +The user-facing outcome of this proposal is that every Iceberg data file and delete file flows through the same stable and extensible API. Parquet is the first format to land. ORC, Avro, and others follow on the same interface. Every goal below serves that outcome. + +1. Define a `FormatModel` trait that encapsulates format-specific read and write behavior independent of the on-disk format. + +2. Remove hard-coded Parquet assumptions from scan and write orchestration. After this work, `TableScan::to_arrow` and `DataFileWriterBuilder` dispatch through the format abstraction instead of constructing Parquet types directly. + +3. Provide a registry that maps `DataFileFormat` values to `FormatModel` implementations, so callers obtain readers and writers without naming the concrete format type. + +4. Define a conformance test suite (TCK) that any `FormatModel` implementation must pass before it merges. + +5. Match the Java and PyIceberg designs where they align, and diverge where Rust's single-data-type ecosystem and pre-1.0 status justify it. Divergences are called out inline. + +## Non-Goals + +The items below are deliberately out of scope to keep this proposal focused on the abstraction and its Parquet implementation. Most are follow-up work that the API enables but does not itself deliver. + +1. **Ship new format implementations.** This RFC lands the abstraction and a Parquet implementation. ORC, Avro data-file, Vortex, and Lance are follow-up RFCs. + +2. **Introduce a plugin protocol or runtime library loading.** Rust does not offer a clean mechanism for loading compiled plugins at runtime. A runtime-linking approach using `libloading` or similar would expand scope beyond what the formats currently under discussion (Parquet, ORC, Avro, Vortex, Lance) require. + +3. **Add Puffin support to the FormatModel.** Puffin files hold statistics and deletion vectors rather than row data. They have a different lifecycle from data files and are already handled separately in `crates/iceberg/src/puffin/`. + +4. **Redesign the writer trait hierarchy.** The existing `IcebergWriter` and `FileWriter` layering is sound. This RFC adds a format abstraction beneath `FileWriter`, not a replacement for it. + +5. **Implement variant shredding or encryption.** Java exposes `engineProjection` and `engineSchema` as extension points for variant shredding and similar format-specific type mapping, and `withFileEncryptionKey` and `withAADPrefix` for Parquet encryption. Equivalent hooks are noted as future extensions in the Rust design. Implementing either requires a dedicated RFC. Review Comment: Encryption work is currently in-flight so I'd love to understand how we could incorporate that in this plan. ``` let mut wb = registry.write_builder(format, output)?; if let Some(em) = &self.encryption_manager { let (dek, aad) = em.create_file_key().await?; wb.with_file_encryption_key(&dek); wb.with_aad_prefix(&aad); } let writer = wb.build().await?; `` I need to think if something like this might make sense? ########## docs/rfcs/0003_file_format_api.md: ########## @@ -0,0 +1,520 @@ +<!-- + ~ Licensed to the Apache Software Foundation (ASF) under one + ~ or more contributor license agreements. See the NOTICE file + ~ distributed with this work for additional information + ~ regarding copyright ownership. The ASF licenses this file + ~ to you under the Apache License, Version 2.0 (the + ~ "License"); you may not use this file except in compliance + ~ with the License. You may obtain a copy of the License at + ~ + ~ http://www.apache.org/licenses/LICENSE-2.0 + ~ + ~ Unless required by applicable law or agreed to in writing, + ~ software distributed under the License is distributed on an + ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + ~ KIND, either express or implied. See the License for the + ~ specific language governing permissions and limitations + ~ under the License. +--> + +# RFC: File Format API for Apache Iceberg Rust + +## Background + +### Current state + +The `iceberg` crate (version 0.9.0, Rust 1.92 as of this writing) is the core library of the Apache Iceberg Rust project. Its module layout in `crates/iceberg/src/lib.rs` exposes public modules for `spec`, `arrow`, `writer`, `scan`, `io`, `expr`, `transaction`, and others. The crate depends directly on the `parquet` crate (with the `async` feature) and on the `arrow-*` crates. It has no feature flags today. + +For data file writing, the crate provides a three-layer architecture described in `crates/iceberg/src/writer/mod.rs`: + +1. **`FileWriter` / `FileWriterBuilder`** in `writer/file_writer/mod.rs`: traits for physical file writers, generic over an output type (defaulting to `Vec<DataFileBuilder>`). `FileWriterBuilder::build` takes an `OutputFile` and returns a `FileWriter`. `FileWriter::write` takes a `&RecordBatch`. + +2. **`IcebergWriter` / `IcebergWriterBuilder`** in `writer/mod.rs`: traits for logical Iceberg writers (data files, equality deletes, position deletes, partitioning), defaulting to `RecordBatch` input and `Vec<DataFile>` output. + +3. **Concrete implementations**: `ParquetWriterBuilder` and `ParquetWriter` in `writer/file_writer/parquet_writer.rs` are the only `FileWriter` implementation. Higher-level writers such as `DataFileWriterBuilder` (`writer/base_writer/data_file_writer.rs`) and `EqualityDeleteFileWriterBuilder` (`writer/base_writer/equality_delete_writer.rs`) are generic over any `FileWriterBuilder`, but every example, test, and integration instantiates them with `ParquetWriterBuilder`. + +The trait layer is format-agnostic. Every concrete instantiation uses Parquet. + +For data file reading, the crate provides `ArrowReaderBuilder` and `ArrowReader` in `arrow/reader.rs`. Both are Parquet-specific despite the generic name. `ArrowReader::process_file_scan_task` calls `open_parquet_file` directly, constructs a `ParquetRecordBatchReaderBuilder`, and uses Parquet-specific row group filtering and page index logic. `TableScan::to_arrow` in `scan/mod.rs` wires `ArrowReaderBuilder` in as the only reader path. `FileScanTask` carries a `data_file_format: DataFileFormat` field, but `process_file_scan_task` ignores it. `DataFileFormat` appears throughout `src/`. Almost every non-test reference is `DataFileFormat::Parquet`. The only non-Parquet references are two uses of `DataFileFormat::Avro` in `transaction/snapshot.rs` for manifest files. + +The `DataFileFormat` enum in `spec/manifest/data_file.rs` has four variants: `Avro`, `Orc`, `Parquet`, and `Puffin`. Puffin files hold statistics and deletion vectors rather than row data. They are handled in `crates/iceberg/src/puffin/` and are out of scope for this RFC (see Non-Goals). Data file support today: + +| Format | Data file read | Data file write | Manifests | +|---------|----------------|-----------------|-----------| +| Parquet | Yes | Yes | No | +| Avro | No | No | Yes | +| ORC | No | No | No | + +A table containing ORC or Avro data files cannot be read from Rust today, even though both are valid per the Iceberg spec. + +Arrow `RecordBatch` is the only in-memory data representation in the crate. It is the input type for every writer trait and the output type for every reader. The `arrow/` module provides schema conversion (`schema_to_arrow_schema`, `arrow_schema_to_schema`), `RecordBatchTransformer` for schema evolution and constant column injection, and the Parquet-to-Arrow read path. Every integration crate, including `iceberg-datafusion`, consumes Arrow. The crate does not define a generic `Record` type and does not integrate with engine-specific row types such as Java's `InternalRow` or `RowData`. + +### Pain points + +1. **No extension point for new formats.** Adding ORC means editing `ArrowReader` to branch on `DataFileFormat` and threading ORC-specific logic through every layer that touches it. The write path has the same shape. + +2. **Parquet assumptions leak into generic code.** `ArrowReaderBuilder` exposes Parquet-specific options (`with_metadata_size_hint`, `with_row_group_filtering_enabled`, `with_row_selection_enabled`) that are meaningless for ORC or Avro. The name "ArrowReader" conflates the in-memory representation (Arrow) with the on-disk format (Parquet). + +3. **No format-agnostic statistics.** `ParquetWriter` computes column sizes, value counts, and min/max bounds through `MinMaxColAggregator` and `NanValueCountVisitor`, both tightly coupled to Parquet's `Statistics` type. Another format cannot produce comparable manifest metadata without a shared statistics interface. + +4. **V3 types will need per-format serialization.** The V3 spec adds Variant and Geometry types. Each format encodes them differently: Parquet uses variant shredding, ORC uses binary, Avro uses union types. Implementing either type without a format abstraction means new `match` arms in every reader and writer code path. + +### Prior work + +The Java project shipped `FormatModel<D, S>` in February 2026 (PR [#12774](https://github.com/apache/iceberg/pull/12774)) after a 10-month review. Implementations for Parquet, Avro, ORC, and Arrow followed in PRs [#15253](https://github.com/apache/iceberg/pull/15253) through [#15258](https://github.com/apache/iceberg/pull/15258). Engine migrations for Spark ([#15328](https://github.com/apache/iceberg/pull/15328)) and Flink ([#15329](https://github.com/apache/iceberg/pull/15329)) landed shortly after. + +Java's `FormatModel` carries two generic parameters: a data type `D` (Java uses `Record`, Spark `InternalRow`, Flink `RowData`, and Arrow `ColumnarBatch`) and an engine schema type `S` (Spark `StructType`, Flink `RowType`, and others). A static `FormatModelRegistry` stores implementations keyed by `Pair<FileFormat, Class<?>>` and populates itself through Java reflection at startup. The old `Parquet.WriteBuilder`, `Avro.WriteBuilder`, and `ORC.WriteBuilder` are not deprecated. The new `FormatModel` implementations wrap them rather than replace them. + +PyIceberg has an open proposal for the equivalent capability in issue [apache/iceberg-python#3100](https://github.com/apache/iceberg-python/issues/3100), with an in-progress PR at [apache/iceberg-python#3119](https://github.com/apache/iceberg-python/pull/3119). Because PyIceberg uses PyArrow as its only in-memory representation, the proposal drops Java's generic type parameters and keys the registry on file format alone. Two prior PyIceberg ORC PRs ([#790](https://github.com/apache/iceberg-python/pull/790), [#2236](https://github.com/apache/iceberg-python/pull/2236)) were closed as stale without merging, which reinforces the case for landing an abstraction layer before adding new formats. + +Design decisions in this RFC that differ from Java (no generics, single-dimension registry key, hard cutover from the old Parquet types) are justified inline where they appear. Specific Java design points that shaped those decisions are called out in Alternatives Considered. + +## Goals + +The user-facing outcome of this proposal is that every Iceberg data file and delete file flows through the same stable and extensible API. Parquet is the first format to land. ORC, Avro, and others follow on the same interface. Every goal below serves that outcome. + +1. Define a `FormatModel` trait that encapsulates format-specific read and write behavior independent of the on-disk format. + +2. Remove hard-coded Parquet assumptions from scan and write orchestration. After this work, `TableScan::to_arrow` and `DataFileWriterBuilder` dispatch through the format abstraction instead of constructing Parquet types directly. + +3. Provide a registry that maps `DataFileFormat` values to `FormatModel` implementations, so callers obtain readers and writers without naming the concrete format type. + +4. Define a conformance test suite (TCK) that any `FormatModel` implementation must pass before it merges. + +5. Match the Java and PyIceberg designs where they align, and diverge where Rust's single-data-type ecosystem and pre-1.0 status justify it. Divergences are called out inline. + +## Non-Goals + +The items below are deliberately out of scope to keep this proposal focused on the abstraction and its Parquet implementation. Most are follow-up work that the API enables but does not itself deliver. + +1. **Ship new format implementations.** This RFC lands the abstraction and a Parquet implementation. ORC, Avro data-file, Vortex, and Lance are follow-up RFCs. + +2. **Introduce a plugin protocol or runtime library loading.** Rust does not offer a clean mechanism for loading compiled plugins at runtime. A runtime-linking approach using `libloading` or similar would expand scope beyond what the formats currently under discussion (Parquet, ORC, Avro, Vortex, Lance) require. + +3. **Add Puffin support to the FormatModel.** Puffin files hold statistics and deletion vectors rather than row data. They have a different lifecycle from data files and are already handled separately in `crates/iceberg/src/puffin/`. + +4. **Redesign the writer trait hierarchy.** The existing `IcebergWriter` and `FileWriter` layering is sound. This RFC adds a format abstraction beneath `FileWriter`, not a replacement for it. + +5. **Implement variant shredding or encryption.** Java exposes `engineProjection` and `engineSchema` as extension points for variant shredding and similar format-specific type mapping, and `withFileEncryptionKey` and `withAADPrefix` for Parquet encryption. Equivalent hooks are noted as future extensions in the Rust design. Implementing either requires a dedicated RFC. Review Comment: Encryption work is currently in-flight so I'd love to understand how we could incorporate that in this plan. ``` let mut wb = registry.write_builder(format, output)?; if let Some(em) = &self.encryption_manager { let (dek, aad) = em.create_file_key().await?; wb.with_file_encryption_key(&dek); wb.with_aad_prefix(&aad); } let writer = wb.build().await?; ``` I need to think if something like this might make sense? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
