fightBoxing opened a new pull request, #15580:
URL: https://github.com/apache/iceberg/pull/15580

   ## Summary
   
   Add a new `iceberg-lance` module to support 
[Lance](https://github.com/lancedb/lance) columnar data format in Apache 
Iceberg. Lance is a modern columnar format optimized for ML/AI workloads with 
native vector search, O(1) random access, and zero-copy Arrow integration.
   
   ## Changes
   
   ### Modified Files
   - **`api/.../FileFormat.java`** — Added `LANCE("lance", true)` enum value
   - **`settings.gradle`** — Registered `lance` module
   - **`build.gradle`** — Added `iceberg-lance` project configuration with 
Arrow dependencies
   
   ### New Module: `iceberg-lance` (12 core files + 8 test files)
   
   #### Core Classes
   | File | Description |
   |------|-------------|
   | `Lance.java` | Main entry class with `WriteBuilder`, `ReadBuilder`, 
`DataWriteBuilder` (follows Parquet/ORC pattern) |
   | `LanceSchemaUtil.java` | Bidirectional Iceberg Schema ↔ Arrow Schema 
conversion |
   | `LanceValueWriters.java` | Type-specific value writers 
(Boolean/Int/Long/Float/Double/String/Date/Time/Timestamp/Decimal/UUID/Binary) |
   | `LanceValueReaders.java` | Type-specific value readers for all supported 
types |
   | `LanceFileAppender.java` | `FileAppender` implementation with Metrics 
collection |
   | `LanceIterable.java` | `CloseableIterable` implementation with column 
projection support |
   | `LanceMetrics.java` | Metrics builder and `MetricsCollector` for 
rowCount/columnSizes/valueCounts/nullCounts/bounds |
   | `LanceUtil.java` | Configuration constants and utility methods |
   
   #### Data Layer Integration
   | File | Description |
   |------|-------------|
   | `GenericLanceReader.java` | Generic `Record` reader adapter |
   | `GenericLanceWriter.java` | Generic `Record` writer adapter |
   
   #### Tests (60 test cases, all passing)
   | Test Class | Cases | Coverage |
   |-----------|-------|----------|
   | `TestFileFormatLance` | 7 | Enum, splittable, extension, fromString |
   | `TestLanceSchemaUtil` | 7 | Primitive/temporal/decimal/nested/map types, 
round-trip, null validation |
   | `TestLanceValueReadersWriters` | 17 | Full type round-trip + forType 
factory |
   | `TestLanceMetrics` | 5 | Simple/full metrics, collector, bounds, null 
bounds |
   | `TestLanceDataWriter` | 8 | 
Write/builder/metrics/null/empty/schema/length/close-after-write |
   | `TestLanceDataReader` | 6 | Read/builder/round-trip/null/empty/large 
dataset |
   | `TestLanceReadProjection` | 4 | Column pruning, single/full/builder 
projection |
   | `TestLanceUtil` | 6 | Extension, properties, fragment size, compression, 
constants |
   
   ## Architecture Design
   
   ### Why Lance in Iceberg?
   
   | Dimension | Parquet/ORC | Lance | Value |
   |-----------|-------------|-------|-------|
   | Random Access | Full RowGroup/Stripe scan | O(1) row-level | 10-100x for 
AI inference |
   | Vector Search | Not native | Built-in ANN index | No external vector DB 
needed |
   | Update Efficiency | Copy-on-Write full rewrite | Native row-level update | 
Frequent update scenarios |
   | Arrow Integration | Serialization required | Zero-copy mapping | Memory 
efficiency |
   
   ### Extension Architecture
   
   ```
   iceberg-lance/
   ├── src/main/java/org/apache/iceberg/lance/    (10 core classes)
   │   ├── Lance.java              — Entry: WriteBuilder / ReadBuilder / 
DataWriteBuilder
   │   ├── LanceSchemaUtil.java    — Iceberg Schema ↔ Arrow Schema conversion
   │   ├── LanceValueWriters.java  — Write Iceberg values to Arrow vectors
   │   ├── LanceValueReaders.java  — Read Arrow vectors to Iceberg types
   │   ├── LanceFileAppender.java  — FileAppender with Metrics
   │   ├── LanceIterable.java      — CloseableIterable with projection
   │   ├── LanceMetrics.java       — Metrics collection
   │   └── LanceUtil.java          — Config constants and utilities
   ├── src/main/java/org/apache/iceberg/data/lance/  (2 adapters)
   │   ├── GenericLanceReader.java
   │   └── GenericLanceWriter.java
   └── src/test/java/org/apache/iceberg/lance/       (8 test classes)
   ```
   
   ### Type Mapping (Iceberg ↔ Arrow ↔ Lance)
   
   | Iceberg Type | Arrow Type | Lance Type |
   |-------------|------------|------------|
   | BooleanType | Bool | Bool |
   | IntegerType | Int32 | Int32 |
   | LongType | Int64 | Int64 |
   | FloatType | Float32 | Float32 |
   | DoubleType | Float64 | Float64 |
   | DateType | Date32 | Date32 |
   | TimeType | Time64(µs) | Time64(µs) |
   | TimestampType | Timestamp(µs) | Timestamp(µs) |
   | StringType | Utf8 | Utf8 |
   | BinaryType | Binary | Binary |
   | DecimalType(p,s) | Decimal128(p,s) | Decimal128(p,s) |
   | UUIDType | FixedSizeBinary(16) | FixedSizeBinary(16) |
   | ListType | List | List |
   | MapType | Map | Map |
   | StructType | Struct | Struct |
   
   ### Design Principles
   
   1. **Follow existing patterns** — Fully mirrors `iceberg-parquet` and 
`iceberg-orc` module structure
   2. **Optional dependency** — Registered via reflection in `InternalData`, no 
impact on existing functionality
   3. **Arrow-native** — Uses Arrow as intermediate representation for 
zero-copy integration
   4. **Complete Metrics** — Full support for rowCount, columnSizes, 
valueCounts, nullCounts, lowerBounds, upperBounds
   5. **Column projection** — `LanceIterable` supports projection reads with 
fieldId matching
   
   ### Implementation Roadmap
   
   - **Phase 1** ✅ Core format module (this PR)
   - **Phase 2** 🔜 Data layer integration (BaseFileWriterFactory, 
GenericAppenderFactory)
   - **Phase 3** 🔜 Engine integration (Spark/Flink readers and writers)
   - **Phase 4** 🔜 Advanced features (ANN vector search, row-level 
Merge-on-Read)
   
   ### CI Checks
   
   All checks pass locally:
   - ✅ Spotless (Google Java Format)
   - ✅ Checkstyle
   - ✅ Error-Prone static analysis
   - ✅ Javadoc compilation
   - ✅ Apache License headers
   - ✅ Unit tests (60/60 passed)
   - ✅ Full build
   
   > 📄 Full architecture design document: see `iceberg-lance-format-design.md` 
in this PR.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to