fightBoxing opened a new pull request, #15580: URL: https://github.com/apache/iceberg/pull/15580
## Summary Add a new `iceberg-lance` module to support [Lance](https://github.com/lancedb/lance) columnar data format in Apache Iceberg. Lance is a modern columnar format optimized for ML/AI workloads with native vector search, O(1) random access, and zero-copy Arrow integration. ## Changes ### Modified Files - **`api/.../FileFormat.java`** — Added `LANCE("lance", true)` enum value - **`settings.gradle`** — Registered `lance` module - **`build.gradle`** — Added `iceberg-lance` project configuration with Arrow dependencies ### New Module: `iceberg-lance` (12 core files + 8 test files) #### Core Classes | File | Description | |------|-------------| | `Lance.java` | Main entry class with `WriteBuilder`, `ReadBuilder`, `DataWriteBuilder` (follows Parquet/ORC pattern) | | `LanceSchemaUtil.java` | Bidirectional Iceberg Schema ↔ Arrow Schema conversion | | `LanceValueWriters.java` | Type-specific value writers (Boolean/Int/Long/Float/Double/String/Date/Time/Timestamp/Decimal/UUID/Binary) | | `LanceValueReaders.java` | Type-specific value readers for all supported types | | `LanceFileAppender.java` | `FileAppender` implementation with Metrics collection | | `LanceIterable.java` | `CloseableIterable` implementation with column projection support | | `LanceMetrics.java` | Metrics builder and `MetricsCollector` for rowCount/columnSizes/valueCounts/nullCounts/bounds | | `LanceUtil.java` | Configuration constants and utility methods | #### Data Layer Integration | File | Description | |------|-------------| | `GenericLanceReader.java` | Generic `Record` reader adapter | | `GenericLanceWriter.java` | Generic `Record` writer adapter | #### Tests (60 test cases, all passing) | Test Class | Cases | Coverage | |-----------|-------|----------| | `TestFileFormatLance` | 7 | Enum, splittable, extension, fromString | | `TestLanceSchemaUtil` | 7 | Primitive/temporal/decimal/nested/map types, round-trip, null validation | | `TestLanceValueReadersWriters` | 17 | Full type round-trip + forType factory | | `TestLanceMetrics` | 5 | Simple/full metrics, collector, bounds, null bounds | | `TestLanceDataWriter` | 8 | Write/builder/metrics/null/empty/schema/length/close-after-write | | `TestLanceDataReader` | 6 | Read/builder/round-trip/null/empty/large dataset | | `TestLanceReadProjection` | 4 | Column pruning, single/full/builder projection | | `TestLanceUtil` | 6 | Extension, properties, fragment size, compression, constants | ## Architecture Design ### Why Lance in Iceberg? | Dimension | Parquet/ORC | Lance | Value | |-----------|-------------|-------|-------| | Random Access | Full RowGroup/Stripe scan | O(1) row-level | 10-100x for AI inference | | Vector Search | Not native | Built-in ANN index | No external vector DB needed | | Update Efficiency | Copy-on-Write full rewrite | Native row-level update | Frequent update scenarios | | Arrow Integration | Serialization required | Zero-copy mapping | Memory efficiency | ### Extension Architecture ``` iceberg-lance/ ├── src/main/java/org/apache/iceberg/lance/ (10 core classes) │ ├── Lance.java — Entry: WriteBuilder / ReadBuilder / DataWriteBuilder │ ├── LanceSchemaUtil.java — Iceberg Schema ↔ Arrow Schema conversion │ ├── LanceValueWriters.java — Write Iceberg values to Arrow vectors │ ├── LanceValueReaders.java — Read Arrow vectors to Iceberg types │ ├── LanceFileAppender.java — FileAppender with Metrics │ ├── LanceIterable.java — CloseableIterable with projection │ ├── LanceMetrics.java — Metrics collection │ └── LanceUtil.java — Config constants and utilities ├── src/main/java/org/apache/iceberg/data/lance/ (2 adapters) │ ├── GenericLanceReader.java │ └── GenericLanceWriter.java └── src/test/java/org/apache/iceberg/lance/ (8 test classes) ``` ### Type Mapping (Iceberg ↔ Arrow ↔ Lance) | Iceberg Type | Arrow Type | Lance Type | |-------------|------------|------------| | BooleanType | Bool | Bool | | IntegerType | Int32 | Int32 | | LongType | Int64 | Int64 | | FloatType | Float32 | Float32 | | DoubleType | Float64 | Float64 | | DateType | Date32 | Date32 | | TimeType | Time64(µs) | Time64(µs) | | TimestampType | Timestamp(µs) | Timestamp(µs) | | StringType | Utf8 | Utf8 | | BinaryType | Binary | Binary | | DecimalType(p,s) | Decimal128(p,s) | Decimal128(p,s) | | UUIDType | FixedSizeBinary(16) | FixedSizeBinary(16) | | ListType | List | List | | MapType | Map | Map | | StructType | Struct | Struct | ### Design Principles 1. **Follow existing patterns** — Fully mirrors `iceberg-parquet` and `iceberg-orc` module structure 2. **Optional dependency** — Registered via reflection in `InternalData`, no impact on existing functionality 3. **Arrow-native** — Uses Arrow as intermediate representation for zero-copy integration 4. **Complete Metrics** — Full support for rowCount, columnSizes, valueCounts, nullCounts, lowerBounds, upperBounds 5. **Column projection** — `LanceIterable` supports projection reads with fieldId matching ### Implementation Roadmap - **Phase 1** ✅ Core format module (this PR) - **Phase 2** 🔜 Data layer integration (BaseFileWriterFactory, GenericAppenderFactory) - **Phase 3** 🔜 Engine integration (Spark/Flink readers and writers) - **Phase 4** 🔜 Advanced features (ANN vector search, row-level Merge-on-Read) ### CI Checks All checks pass locally: - ✅ Spotless (Google Java Format) - ✅ Checkstyle - ✅ Error-Prone static analysis - ✅ Javadoc compilation - ✅ Apache License headers - ✅ Unit tests (60/60 passed) - ✅ Full build > 📄 Full architecture design document: see `iceberg-lance-format-design.md` in this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
