Re: [PR] docs: add batch vector search demo + notebook for RFC-102 [hudi]

via GitHub Wed, 13 May 2026 15:06:25 -0700


hudi-agent commented on code in PR #18729:
URL: https://github.com/apache/hudi/pull/18729#discussion_r3237725438



##########
hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/notebooks/README.md:
##########
@@ -130,6 +130,29 @@ BASE_FILE_FORMAT  = "parquet"      # "parquet" or "lance"
 N_SAMPLES         = 256
 ```
 
+### `04_vector_search_batch.ipynb` — supplemental: batch TVF certification
+
+Exercises **`hudi_vector_search_batch`** (RFC-102) — the table-to-table form
+of vector search. Builds two Hudi tables (corpus + queries) and asserts the
+TVF's top-K per query matches a **numpy ground-truth oracle** that recomputes
+the cosine distance matrix from the same embeddings. The notebook prints
+`CERTIFIED ✓` on success or fails the cell loudly on the first divergence.
+
+Toggle variables:
+
+```python
+BASE_FILE_FORMAT = "parquet"   # "parquet" or "lance"
+N_CORPUS         = 1000         # corpus row count
+N_QUERIES        = 20           # query table row count
+TOP_K            = 5
+EMBEDDING_MODEL  = "mobilenetv3_small_100"
+```
+
+A `1000 × 20 × k=5` run produces a 20,000-row cross-join intermediate inside
+`BruteForceSearchAlgorithm.buildBatchQueryPlan`, large enough to exercise
+the broadcast + window-rank machinery while still completing in under a
+minute on a 4 GB driver heap.

Review Comment:
   🤖 The phrasing "a 20,000-row cross-join intermediate inside 
`BruteForceSearchAlgorithm.buildBatchQueryPlan`" exposes an internal 
implementation detail. Is this stable enough to document, or could the 
class/method name shift before RFC-102 ships? Linking the RFC or making this 
slightly more abstract (e.g. "a 20,000-row cross-join intermediate during the 
brute-force batch plan") would insulate the docs from future refactors.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/notebooks/README.md:
##########
@@ -130,6 +130,29 @@ BASE_FILE_FORMAT  = "parquet"      # "parquet" or "lance"
 N_SAMPLES         = 256
 ```
 
+### `04_vector_search_batch.ipynb` — supplemental: batch TVF certification
+
+Exercises **`hudi_vector_search_batch`** (RFC-102) — the table-to-table form
+of vector search. Builds two Hudi tables (corpus + queries) and asserts the
+TVF's top-K per query matches a **numpy ground-truth oracle** that recomputes
+the cosine distance matrix from the same embeddings. The notebook prints
+`CERTIFIED ✓` on success or fails the cell loudly on the first divergence.
+
+Toggle variables:
+
+```python
+BASE_FILE_FORMAT = "parquet"   # "parquet" or "lance"
+N_CORPUS         = 1000         # corpus row count
+N_QUERIES        = 20           # query table row count
+TOP_K            = 5
+EMBEDDING_MODEL  = "mobilenetv3_small_100"
+```
+
+A `1000 × 20 × k=5` run produces a 20,000-row cross-join intermediate inside
+`BruteForceSearchAlgorithm.buildBatchQueryPlan`, large enough to exercise

Review Comment:
   🤖 Minor: the cosine-distance oracle tolerance (`1e-5`, per the PR 
description) and the comparison semantics (top-K identity vs. 
score-within-tolerance) aren't mentioned here. A one-line note on what 
"matches" means in the oracle would help notebook readers understand exactly 
what `CERTIFIED ✓` is asserting — e.g., is it the set of top-K ids per query, 
the ranked order, the absolute distance values, or all three?
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/README.md:
##########
@@ -25,18 +25,20 @@ the Oxford-IIIT Pet dataset:
 2. **BLOB type (INLINE)** — image bytes are written as a Hudi BLOB struct
    tagged with `hudi_type = "BLOB"`.
 3. **Vector search** — cosine similarity top-K via the
-   `hudi_vector_search` SQL table-valued function, backed by Lance files.
+   `hudi_vector_search` and `hudi_vector_search_batch` SQL table-valued
+   functions, backed by Lance files.
 
-## Three variants
+## Four variants
 
-The folder ships three scripts — each focused on a specific Hudi feature.
+The folder ships four scripts — each focused on a specific Hudi feature.
 Run them independently or in sequence for a full walkthrough.
 
 | File | Feature focus | Surface | Best for |
 |---|---|---|---|
 | [`hudi_blob_reader_demo.py`](hudi_blob_reader_demo.py) | **OUT_OF_LINE BLOBs 
+ `read_blob()`** — Hudi table stores references to bytes living in a separate 
container file; `read_blob()` resolves them on demand | Spark SQL | Showing the 
"lakehouse that references unstructured data without copying" story — tiny Hudi 
table, bytes elsewhere |
-| [`hudi_sql_vector_blob_demo.py`](hudi_sql_vector_blob_demo.py) | **INLINE 
BLOBs + VECTOR + `hudi_vector_search`** — bytes embedded in the Hudi base 
files, cosine similarity search via the TVF | Spark SQL — `CREATE TABLE ... 
(embedding VECTOR(N), image_bytes BLOB, ...) USING hudi`, 
`named_struct('type','INLINE', ...)`, `hudi_vector_search(...)` | Live demos; 
SQL-first users; showing the Hudi 1.2.0 DDL/DML surface the way it's documented 
|
+| [`hudi_sql_vector_blob_demo.py`](hudi_sql_vector_blob_demo.py) | **INLINE 
BLOBs + VECTOR + `hudi_vector_search`** — bytes embedded in the Hudi base 
files, single-query cosine similarity search via the TVF | Spark SQL — `CREATE 
TABLE ... (embedding VECTOR(N), image_bytes BLOB, ...) USING hudi`, 
`named_struct('type','INLINE', ...)`, `hudi_vector_search(...)` | Live demos; 
SQL-first users; showing the Hudi 1.2.0 DDL/DML surface the way it's documented 
|
 | [`hudi_dataframe_vector_blob_demo.py`](hudi_dataframe_vector_blob_demo.py) | 
Same as the SQL demo, but via DataFrame | Python DataFrame API — 
`spark.createDataFrame(rows, explicit_schema)` with `containsNull=False` and 
`hudi_type` metadata declared upfront, then 
`df.write.format("hudi").save(path)` | Library-style integration; seeing how 
the Python DataFrame API composes the VECTOR/BLOB logical types under the hood |
+| [`hudi_vector_search_batch_demo.py`](hudi_vector_search_batch_demo.py) | 
**`hudi_vector_search_batch` certification** — table-to-table batch KNN 
(RFC-102), 1000-row corpus × 20-row query table, with a **numpy ground-truth 
oracle** that fails the run if the TVF disagrees | Spark SQL — 
`hudi_vector_search_batch('<corpus>', 'embedding', '<queries>', 'embedding', k, 
'cosine')` | Certifying batch-mode correctness at non-trivial scale; broadcast 
+ window-rank machinery under load |
 
 All three share the same venv, jars, and env vars. They write to different

Review Comment:
   🤖 Now that a fourth variant has been added, this sentence still reads "All 
three share the same venv, jars, and env vars." Consider updating to "All four 
share ..." to stay consistent with the new "Four variants" heading and table 
above.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/README.md:
##########
@@ -160,6 +163,24 @@ export HUDI_LANCE_DEMO_N=1000
 python hudi_dataframe_vector_blob_demo.py        # or 
hudi_sql_vector_blob_demo.py
 ```
 
+### Run the batch vector search certification
+
+```bash
+# Defaults: 1000-row corpus × 20-row queries × top-k=5, against the corpus.
+HUDI_BASE_FILE_FORMAT=parquet python hudi_vector_search_batch_demo.py
+```
+
+The run ends with a numpy ground-truth oracle that compares the TVF's
+top-K per query against a locally computed cosine distance matrix. A
+successful run ends with `CERTIFIED ✓` and writes

Review Comment:
   🤖 The text says a successful run "writes 
`outputs/hudi_vector_search_batch_<format>_results.png` (one row per query, 
showing its top-K matches)." Could you confirm the output filename pattern and 
directory match what the script actually produces (including how it interacts 
with `HUDI_LANCE_DEMO_OUTDIR`, which is documented as the configurable output 
directory for query/top-K PNGs)? A 20-row × top-K=5 panel can be large — a 
brief note about expected dimensions or how it scales with 
`HUDI_BATCH_N_QUERIES`/`HUDI_BATCH_TOP_K` would help users know what to expect.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/README.md:
##########
@@ -160,6 +163,24 @@ export HUDI_LANCE_DEMO_N=1000
 python hudi_dataframe_vector_blob_demo.py        # or 
hudi_sql_vector_blob_demo.py
 ```
 
+### Run the batch vector search certification
+
+```bash
+# Defaults: 1000-row corpus × 20-row queries × top-k=5, against the corpus.

Review Comment:
   🤖 It might help to mention here (or near the knobs list) that the queries 
table also lives at a separate Hudi path — the existing prose a few lines up 
only enumerates `/tmp/hudi_blob_reader_*` and `/tmp/hudi_sql_*` paths. New 
users running all four demos in sequence will want to know where the batch 
demo's corpus and query tables land (and whether they can be cleaned up safely).
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] docs: add batch vector search demo + notebook for RFC-102 [hudi]

Reply via email to