hudi-agent commented on code in PR #18729:
URL: https://github.com/apache/hudi/pull/18729#discussion_r3237725438
##########
hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/notebooks/README.md:
##########
@@ -130,6 +130,29 @@ BASE_FILE_FORMAT = "parquet" # "parquet" or "lance"
N_SAMPLES = 256
```
+### `04_vector_search_batch.ipynb` — supplemental: batch TVF certification
+
+Exercises **`hudi_vector_search_batch`** (RFC-102) — the table-to-table form
+of vector search. Builds two Hudi tables (corpus + queries) and asserts the
+TVF's top-K per query matches a **numpy ground-truth oracle** that recomputes
+the cosine distance matrix from the same embeddings. The notebook prints
+`CERTIFIED ✓` on success or fails the cell loudly on the first divergence.
+
+Toggle variables:
+
+```python
+BASE_FILE_FORMAT = "parquet" # "parquet" or "lance"
+N_CORPUS = 1000 # corpus row count
+N_QUERIES = 20 # query table row count
+TOP_K = 5
+EMBEDDING_MODEL = "mobilenetv3_small_100"
+```
+
+A `1000 × 20 × k=5` run produces a 20,000-row cross-join intermediate inside
+`BruteForceSearchAlgorithm.buildBatchQueryPlan`, large enough to exercise
+the broadcast + window-rank machinery while still completing in under a
+minute on a 4 GB driver heap.
Review Comment:
🤖 The phrasing "a 20,000-row cross-join intermediate inside
`BruteForceSearchAlgorithm.buildBatchQueryPlan`" exposes an internal
implementation detail. Is this stable enough to document, or could the
class/method name shift before RFC-102 ships? Linking the RFC or making this
slightly more abstract (e.g. "a 20,000-row cross-join intermediate during the
brute-force batch plan") would insulate the docs from future refactors.
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/notebooks/README.md:
##########
@@ -130,6 +130,29 @@ BASE_FILE_FORMAT = "parquet" # "parquet" or "lance"
N_SAMPLES = 256
```
+### `04_vector_search_batch.ipynb` — supplemental: batch TVF certification
+
+Exercises **`hudi_vector_search_batch`** (RFC-102) — the table-to-table form
+of vector search. Builds two Hudi tables (corpus + queries) and asserts the
+TVF's top-K per query matches a **numpy ground-truth oracle** that recomputes
+the cosine distance matrix from the same embeddings. The notebook prints
+`CERTIFIED ✓` on success or fails the cell loudly on the first divergence.
+
+Toggle variables:
+
+```python
+BASE_FILE_FORMAT = "parquet" # "parquet" or "lance"
+N_CORPUS = 1000 # corpus row count
+N_QUERIES = 20 # query table row count
+TOP_K = 5
+EMBEDDING_MODEL = "mobilenetv3_small_100"
+```
+
+A `1000 × 20 × k=5` run produces a 20,000-row cross-join intermediate inside
+`BruteForceSearchAlgorithm.buildBatchQueryPlan`, large enough to exercise
Review Comment:
🤖 Minor: the cosine-distance oracle tolerance (`1e-5`, per the PR
description) and the comparison semantics (top-K identity vs.
score-within-tolerance) aren't mentioned here. A one-line note on what
"matches" means in the oracle would help notebook readers understand exactly
what `CERTIFIED ✓` is asserting — e.g., is it the set of top-K ids per query,
the ranked order, the absolute distance values, or all three?
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/README.md:
##########
@@ -25,18 +25,20 @@ the Oxford-IIIT Pet dataset:
2. **BLOB type (INLINE)** — image bytes are written as a Hudi BLOB struct
tagged with `hudi_type = "BLOB"`.
3. **Vector search** — cosine similarity top-K via the
- `hudi_vector_search` SQL table-valued function, backed by Lance files.
+ `hudi_vector_search` and `hudi_vector_search_batch` SQL table-valued
+ functions, backed by Lance files.
-## Three variants
+## Four variants
-The folder ships three scripts — each focused on a specific Hudi feature.
+The folder ships four scripts — each focused on a specific Hudi feature.
Run them independently or in sequence for a full walkthrough.
| File | Feature focus | Surface | Best for |
|---|---|---|---|
| [`hudi_blob_reader_demo.py`](hudi_blob_reader_demo.py) | **OUT_OF_LINE BLOBs
+ `read_blob()`** — Hudi table stores references to bytes living in a separate
container file; `read_blob()` resolves them on demand | Spark SQL | Showing the
"lakehouse that references unstructured data without copying" story — tiny Hudi
table, bytes elsewhere |
-| [`hudi_sql_vector_blob_demo.py`](hudi_sql_vector_blob_demo.py) | **INLINE
BLOBs + VECTOR + `hudi_vector_search`** — bytes embedded in the Hudi base
files, cosine similarity search via the TVF | Spark SQL — `CREATE TABLE ...
(embedding VECTOR(N), image_bytes BLOB, ...) USING hudi`,
`named_struct('type','INLINE', ...)`, `hudi_vector_search(...)` | Live demos;
SQL-first users; showing the Hudi 1.2.0 DDL/DML surface the way it's documented
|
+| [`hudi_sql_vector_blob_demo.py`](hudi_sql_vector_blob_demo.py) | **INLINE
BLOBs + VECTOR + `hudi_vector_search`** — bytes embedded in the Hudi base
files, single-query cosine similarity search via the TVF | Spark SQL — `CREATE
TABLE ... (embedding VECTOR(N), image_bytes BLOB, ...) USING hudi`,
`named_struct('type','INLINE', ...)`, `hudi_vector_search(...)` | Live demos;
SQL-first users; showing the Hudi 1.2.0 DDL/DML surface the way it's documented
|
| [`hudi_dataframe_vector_blob_demo.py`](hudi_dataframe_vector_blob_demo.py) |
Same as the SQL demo, but via DataFrame | Python DataFrame API —
`spark.createDataFrame(rows, explicit_schema)` with `containsNull=False` and
`hudi_type` metadata declared upfront, then
`df.write.format("hudi").save(path)` | Library-style integration; seeing how
the Python DataFrame API composes the VECTOR/BLOB logical types under the hood |
+| [`hudi_vector_search_batch_demo.py`](hudi_vector_search_batch_demo.py) |
**`hudi_vector_search_batch` certification** — table-to-table batch KNN
(RFC-102), 1000-row corpus × 20-row query table, with a **numpy ground-truth
oracle** that fails the run if the TVF disagrees | Spark SQL —
`hudi_vector_search_batch('<corpus>', 'embedding', '<queries>', 'embedding', k,
'cosine')` | Certifying batch-mode correctness at non-trivial scale; broadcast
+ window-rank machinery under load |
All three share the same venv, jars, and env vars. They write to different
Review Comment:
🤖 Now that a fourth variant has been added, this sentence still reads "All
three share the same venv, jars, and env vars." Consider updating to "All four
share ..." to stay consistent with the new "Four variants" heading and table
above.
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/README.md:
##########
@@ -160,6 +163,24 @@ export HUDI_LANCE_DEMO_N=1000
python hudi_dataframe_vector_blob_demo.py # or
hudi_sql_vector_blob_demo.py
```
+### Run the batch vector search certification
+
+```bash
+# Defaults: 1000-row corpus × 20-row queries × top-k=5, against the corpus.
+HUDI_BASE_FILE_FORMAT=parquet python hudi_vector_search_batch_demo.py
+```
+
+The run ends with a numpy ground-truth oracle that compares the TVF's
+top-K per query against a locally computed cosine distance matrix. A
+successful run ends with `CERTIFIED ✓` and writes
Review Comment:
🤖 The text says a successful run "writes
`outputs/hudi_vector_search_batch_<format>_results.png` (one row per query,
showing its top-K matches)." Could you confirm the output filename pattern and
directory match what the script actually produces (including how it interacts
with `HUDI_LANCE_DEMO_OUTDIR`, which is documented as the configurable output
directory for query/top-K PNGs)? A 20-row × top-K=5 panel can be large — a
brief note about expected dimensions or how it scales with
`HUDI_BATCH_N_QUERIES`/`HUDI_BATCH_TOP_K` would help users know what to expect.
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/README.md:
##########
@@ -160,6 +163,24 @@ export HUDI_LANCE_DEMO_N=1000
python hudi_dataframe_vector_blob_demo.py # or
hudi_sql_vector_blob_demo.py
```
+### Run the batch vector search certification
+
+```bash
+# Defaults: 1000-row corpus × 20-row queries × top-k=5, against the corpus.
Review Comment:
🤖 It might help to mention here (or near the knobs list) that the queries
table also lives at a separate Hudi path — the existing prose a few lines up
only enumerates `/tmp/hudi_blob_reader_*` and `/tmp/hudi_sql_*` paths. New
users running all four demos in sequence will want to know where the batch
demo's corpus and query tables land (and whether they can be cleaned up safely).
<sub><i>- AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]