tardunge opened a new pull request, #15751:
URL: https://github.com/apache/iceberg/pull/15751
This is a **draft/POC** — not requesting merge. Opening for visibility and
discussion.
Relates to #12225 (File Format API) and #13438 (Lance integration
discussion).
## Summary
Working proof-of-concept implementing Lance as an Iceberg file format using
the
File Format API (#12774). Lance is a columnar format for AI/ML workloads with
100x faster random access than Parquet, native vector search, and an
Arrow-native
data model.
This is the first new format added through the pluggable File Format API.
## What works
- Full read/write round-trip via `FormatModel` / `ReadBuilder` /
`WriteBuilder`
- Spark 4.1 SQL: `CREATE TABLE`, `INSERT INTO`, `SELECT`, column projection,
`WHERE` filtering
- Both vectorized (`ColumnarBatch` via `ArrowColumnVector`) and row-based
(`InternalRow`) read paths
- 19 unit tests (round-trip, schema preservation, nulls, projection, batch
sizing, file length)
- Iceberg schema stored in Lance file metadata via Arrow field ID
preservation
### Spark SQL demo
```sql
CREATE TABLE t (id INT, name STRING, score DOUBLE)
USING iceberg TBLPROPERTIES ('write.format.default' = 'lance');
INSERT INTO t VALUES (1, 'alice', 95.5), (2, 'bob', 87.3), (3, 'charlie',
92.1);
SELECT * FROM t WHERE score > 90;
-- 1 alice 95.5
-- 3 charlie 92.1
```
## Changes
### New module: `lance/`
| File | Purpose |
|---|---|
| `LanceFormatModel` | `FormatModel` impl with `WriteBuilderWrapper` /
`ReadBuilderWrapper` |
| `LanceFileAppender` | Bridges record-at-a-time `add(D)` to batch
`write(VectorSchemaRoot)` |
| `LanceSchemaUtil` | Bidirectional Iceberg-Arrow schema conversion with
field ID preservation |
| `LanceArrowConverter` | Row-level Record-Arrow vector type conversion |
| `GenericLanceReader/Writer` | `ReaderFunction` / `WriterFunction` for
generic `Record` |
| `LanceFormatModels` | Registration entry point (auto-discovered by
`FormatModelRegistry`) |
| `spark/SparkLanceFormatModels` | Registers Lance for `ColumnarBatch` +
`InternalRow` |
| `spark/SparkLanceColumnarReader` | Zero-copy Arrow to `ColumnarBatch` via
`ArrowColumnVector` |
| `spark/SparkLanceRowReader` | Arrow to `GenericInternalRow` row-by-row
conversion |
| `spark/SparkLanceWriter` | `InternalRow` to Arrow for Lance writes |
| `LANCE_SDK_GAPS.md` | Documents 5 gaps in the Lance Java SDK |
### Modified files
| File | Change |
|---|---|
| `FileFormat.java` | Add `LANCE("lance", false)` |
| `FormatModelRegistry.java` | Add `LanceFormatModels` to
`CLASSES_TO_REGISTER` |
| `settings.gradle` | Register `lance` module |
| `build.gradle` | Add `project(':iceberg-lance')` block |
| `spark/v4.1/build.gradle` | Exclude Lance from shadow jar |
## Architecture decisions
**Arrow relocation:** The Spark runtime shadow jar relocates
`org.apache.arrow`.
Lance uses Arrow via JNI (Rust native code with hardcoded Java class names),
so
Lance code cannot be relocated. All Lance code (including Spark
readers/writers)
lives in the `lance/` module outside the shadow jar. The `FormatModel`
interface
boundary has zero Arrow imports, so relocated and unrelocated code never
exchange
Arrow objects.
**Runtime classpath:**
```
--jars
iceberg-spark-runtime.jar,iceberg-lance.jar,lance-core.jar,jar-jni.jar,arrow-c-data.jar
```
**Spark registration:** `LanceFormatModels.register()` conditionally
registers
Spark format models via `Class.forName("...InternalRow")` — skipped when
Spark
is not on the classpath.
## Known gaps
| Gap | Impact | Owner |
|---|---|---|
| File length (`getBytesWritten`) | Uses
`OutputFile.toInputFile().getLength()` workaround | Lance JNI (1-line fix) |
| Column statistics | No file-level pruning | Lance PR lancedb/lance#5639 +
JNI |
| Split planning | One task per file | iceberg-lance + Lance JNI |
| Predicate pushdown | No-op (residual filter preserves correctness) |
iceberg-lance |
| Name mapping | Column rename not supported | iceberg-lance |
Details in `lance/LANCE_SDK_GAPS.md`.
## References
- [Lance](https://lance.org/)
- [File Format API
proposal](https://docs.google.com/document/d/1sF_d4tFxJsZWsZFCyCL9ZE7YuI7-P3VrzMLIrrTIxds)
- [File Format API issue
#12225](https://github.com/apache/iceberg/issues/12225)
- [Lance integration discussion
#13438](https://github.com/apache/iceberg/issues/13438)
- [File Format API PR #12774](https://github.com/apache/iceberg/pull/12774)
- [Lance Java SDK](https://github.com/lancedb/lance/tree/main/java)
cc @pvary @westonpace
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]