tardunge opened a new pull request, #15751:
URL: https://github.com/apache/iceberg/pull/15751

   This is a **draft/POC** — not requesting merge. Opening for visibility and 
discussion.
   
   Relates to #12225 (File Format API) and #13438 (Lance integration 
discussion).
   
   ## Summary
   
   Working proof-of-concept implementing Lance as an Iceberg file format using 
the
   File Format API (#12774). Lance is a columnar format for AI/ML workloads with
   100x faster random access than Parquet, native vector search, and an 
Arrow-native
   data model.
   
   This is the first new format added through the pluggable File Format API.
   
   ## What works
   
   - Full read/write round-trip via `FormatModel` / `ReadBuilder` / 
`WriteBuilder`
   - Spark 4.1 SQL: `CREATE TABLE`, `INSERT INTO`, `SELECT`, column projection, 
`WHERE` filtering
   - Both vectorized (`ColumnarBatch` via `ArrowColumnVector`) and row-based 
(`InternalRow`) read paths
   - 19 unit tests (round-trip, schema preservation, nulls, projection, batch 
sizing, file length)
   - Iceberg schema stored in Lance file metadata via Arrow field ID 
preservation
   
   ### Spark SQL demo
   
   ```sql
   CREATE TABLE t (id INT, name STRING, score DOUBLE)
   USING iceberg TBLPROPERTIES ('write.format.default' = 'lance');
   
   INSERT INTO t VALUES (1, 'alice', 95.5), (2, 'bob', 87.3), (3, 'charlie', 
92.1);
   SELECT * FROM t WHERE score > 90;
   -- 1  alice    95.5
   -- 3  charlie  92.1
   ```
   
   ## Changes
   
   ### New module: `lance/`
   
   | File | Purpose |
   |---|---|
   | `LanceFormatModel` | `FormatModel` impl with `WriteBuilderWrapper` / 
`ReadBuilderWrapper` |
   | `LanceFileAppender` | Bridges record-at-a-time `add(D)` to batch 
`write(VectorSchemaRoot)` |
   | `LanceSchemaUtil` | Bidirectional Iceberg-Arrow schema conversion with 
field ID preservation |
   | `LanceArrowConverter` | Row-level Record-Arrow vector type conversion |
   | `GenericLanceReader/Writer` | `ReaderFunction` / `WriterFunction` for 
generic `Record` |
   | `LanceFormatModels` | Registration entry point (auto-discovered by 
`FormatModelRegistry`) |
   | `spark/SparkLanceFormatModels` | Registers Lance for `ColumnarBatch` + 
`InternalRow` |
   | `spark/SparkLanceColumnarReader` | Zero-copy Arrow to `ColumnarBatch` via 
`ArrowColumnVector` |
   | `spark/SparkLanceRowReader` | Arrow to `GenericInternalRow` row-by-row 
conversion |
   | `spark/SparkLanceWriter` | `InternalRow` to Arrow for Lance writes |
   | `LANCE_SDK_GAPS.md` | Documents 5 gaps in the Lance Java SDK |
   
   ### Modified files
   
   | File | Change |
   |---|---|
   | `FileFormat.java` | Add `LANCE("lance", false)` |
   | `FormatModelRegistry.java` | Add `LanceFormatModels` to 
`CLASSES_TO_REGISTER` |
   | `settings.gradle` | Register `lance` module |
   | `build.gradle` | Add `project(':iceberg-lance')` block |
   | `spark/v4.1/build.gradle` | Exclude Lance from shadow jar |
   
   ## Architecture decisions
   
   **Arrow relocation:** The Spark runtime shadow jar relocates 
`org.apache.arrow`.
   Lance uses Arrow via JNI (Rust native code with hardcoded Java class names), 
so
   Lance code cannot be relocated. All Lance code (including Spark 
readers/writers)
   lives in the `lance/` module outside the shadow jar. The `FormatModel` 
interface
   boundary has zero Arrow imports, so relocated and unrelocated code never 
exchange
   Arrow objects.
   
   **Runtime classpath:**
   ```
   --jars 
iceberg-spark-runtime.jar,iceberg-lance.jar,lance-core.jar,jar-jni.jar,arrow-c-data.jar
   ```
   
   **Spark registration:** `LanceFormatModels.register()` conditionally 
registers
   Spark format models via `Class.forName("...InternalRow")` — skipped when 
Spark
   is not on the classpath.
   
   ## Known gaps
   
   | Gap | Impact | Owner |
   |---|---|---|
   | File length (`getBytesWritten`) | Uses 
`OutputFile.toInputFile().getLength()` workaround | Lance JNI (1-line fix) |
   | Column statistics | No file-level pruning | Lance PR lancedb/lance#5639 + 
JNI |
   | Split planning | One task per file | iceberg-lance + Lance JNI |
   | Predicate pushdown | No-op (residual filter preserves correctness) | 
iceberg-lance |
   | Name mapping | Column rename not supported | iceberg-lance |
   
   Details in `lance/LANCE_SDK_GAPS.md`.
   
   ## References
   
   - [Lance](https://lance.org/)
   - [File Format API 
proposal](https://docs.google.com/document/d/1sF_d4tFxJsZWsZFCyCL9ZE7YuI7-P3VrzMLIrrTIxds)
   - [File Format API issue 
#12225](https://github.com/apache/iceberg/issues/12225)
   - [Lance integration discussion 
#13438](https://github.com/apache/iceberg/issues/13438)
   - [File Format API PR #12774](https://github.com/apache/iceberg/pull/12774)
   - [Lance Java SDK](https://github.com/lancedb/lance/tree/main/java)
   
   cc @pvary @westonpace
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to