Re: [PR] docs: RFC-102 - Spark Vector Search in Apache Hudi [hudi]

via GitHub Wed, 04 Mar 2026 14:07:51 -0800


vinothchandar commented on code in PR #14218:
URL: https://github.com/apache/hudi/pull/14218#discussion_r2886389603



##########
rfc/rfc-102/rfc-102.md:
##########
@@ -0,0 +1,231 @@
+<!--
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# RFC-102: Vector Search in Apache Hudi
+
+## Proposers
+
+- @rahil-c
+
+## Approvers
+
+- @suryaprasanna
+- @vinothchandar
+- @balaji-varadarajan-ai
+- @yihua
+
+## Abstract
+
+This RFC proposes adding the ability to perform a native **vector similarity 
search** on Hudi tables.
+
+Building on RFC-100 (unstructured data storage in Hudi), Hudi tables would 
contain unstructured content (e.g., images, video, documents) as well as the 
related *embeddings* for those contents.
+
+The next natural requirement for AI/ML workloads on Hudi is to **search these 
embeddings efficiently**:
+
+- Given a query embedding (e.g., derived from an image, text, or audio 
snippet),
+- Find the *k* most similar rows in a Hudi table,
+- Return those rows along with a similarity score.
+
+This RFC focuses on:
+
+1. Defining a **user-facing API** (via Hudi Spark SQL integration) for vector 
search on Hudi tables.
+2. Describing a **baseline implementation** for how we can achieve this in 
Spark–Hudi integration.
+
+Note this RFC will also discuss at a high level **future enhancements** around 
a vector index to speed up this search. These details will likely be fleshed 
out in a future RFC around index creation and management.
+
+## Background
+
+### What is a Vector Embedding?
+
+A **vector embedding** is a fixed-dimensional numeric representation 
(typically an array of `FLOAT`/`DOUBLE`) produced by a model to capture 
semantic properties of an object (text, image, audio, etc.). For example:
+
+- An image encoder maps each image to a `d`-dimensional vector capturing 
visual semantics.
+- A text encoder maps a sentence or document to a `d`-dimensional vector 
capturing its meaning.
+
+In the context of how vector embeddings and their search would work in Hudi:
+
+- RFC-100 enables storing unstructured payloads (e.g., 
image/video/audio/document BLOBs).
+- Embedding vectors are typically modeled as `ARRAY<FLOAT>` columns stored 
alongside those payloads in the same Hudi table, typically as another column 
within the table.
+
+This RFC assumes that embeddings are already generated by an encoder/model, 
and are already stored as numeric vectors in a Hudi table, and focuses on how 
to **search** them given a new embedding.
+
+### What is Vector Search? (k-Nearest Neighbors)
+
+Given the following inputs:
+
+- A table `T` whose rows contain an embedding column `embedding: ARRAY<FLOAT>`,
+- A query embedding `q: ARRAY<FLOAT>`,
+- A distance metric (e.g., L2, cosine, dot product),
+
+**A k-nearest neighbor (k-NN) search** computes a similarity score between `q` 
and each row's embedding, then returns the *k* closest rows.
+
+### Visual Example
+
+Consider a Hudi table that contains an image column and an image embedding 
column. For illustration, we'll use 3-dimensional embeddings (real systems 
typically use dimensions much larger):
+
+| id | image      | image_embedding    | label |
+|----|------------|--------------------|-------|
+| 1  | `cat1.jpg` | [0.10, 0.20, 0.30] | cat   |
+| 2  | `cat2.jpg` | [0.11, 0.18, 0.29] | cat   |
+| 3  | `dog1.jpg` | [0.90, 0.80, 0.95] | dog   |
+
+Now suppose the user has a new image of a cat (`cat3.jpg`) and uses an 
external encoder/model to generate its embedding:
+
+- Query embedding for `cat3.jpg`: `q = [0.09, 0.19, 0.31]`
+
+A typical query flow for this example would be:
+
+1. User generates the query embedding `q` for a new image using an 
encoder/model.
+2. User calls a vector search function on the table and passes `q`.
+3. The function reads all rows (or a filtered subset) from the Hudi table.
+4. For each row, it extracts `image_embedding` and computes the distance to 
`q` (e.g., L2 Euclidean distance):
+
+   - `dist(q, cat1.jpg) ≈ 0.02`
+   - `dist(q, cat2.jpg) ≈ 0.03`
+   - `dist(q, dog1.jpg) ≈ 1.21`
+
+5. It keeps the *k* rows with the lowest distance (highest similarity). For 
example, for `k = 2` the results might be:
+
+   | id | image      | label | `_distance` |
+      |----|------------|-------|-------------|
+   | 1  | `cat1.jpg` | cat   | 0.02        |
+   | 2  | `cat2.jpg` | cat   | 0.03        |
+
+
+6. The function returns these rows plus the `_distance` column.
+7. The user materializes or joins these top-k similar images for downstream 
use (e.g., showing “similar images” in an application).
+
+### User Experience
+
+Requirements:
+
+- Users should be able to perform this operation via DataFrames (used heavily 
in AI/ML) and SQL (used primarily for analytics).
+- Expose clear, minimal parameters (table name or path, embedding column in 
the table, query embedding, which distance algorithm, top-k results).
+- Return a table/DataFrame with an extra `_distance` (or `_score`) column that 
can be leveraged by the user for further data manipulation, such as filtering.
+
+### Proposed Interface
+
+We propose a Spark relation that can be used for both Spark DataFrames and 
Spark SQL.
+
+```text
+┌─────────────────────────────────────────────────────────────┐
+│                     User-Facing APIs                       │
+├──────────────────────────┬──────────────────────────────────┤
+│   DataFrame Extension    │   SQL Table-Valued Function      │
+│                          │                                  │
+│  df.vector_search(...)   │  SELECT * FROM                   │
+│                          │  hudi_vector_search(...)         │
+└──────────────┬───────────┴────────────┬─────────────────────┘
+               │                        │
+               ▼                        ▼
+        ┌────────────────────────────────────────┐
+        │   VectorSearchRelation (Scala)         │
+        │   - Core k-NN search logic             │
+        │   - Filter pushdown                    │
+        │   - Distance calculations              │
+        │   - File format integration            │
+        └────────────────┬───────────────────────┘
+                         │
+                         ▼
+                ┌────────────────────┐
+                │   Hudi Table       │
+                │   (with vectors)   │
+                └────────────────────┘
+
+
+
+```
+
+### Query Example
+
+We wanted to follow similar semantics as what other modern data systems offer 
for performing vector search.
+See the following reference material here:
+* 
https://docs.databricks.com/aws/en/sql/language-manual/functions/vector_search
+* https://docs.snowflake.com/en/user-guide/snowflake-cortex/vector-embeddings
+
+An intial start for how a hudi vector interface would look would be something 
like this
+```
+SELECT *
+FROM hudi_vector_search(
+table name or table_path => 'table' OR 's3://bucket/path/to/hudi/table',
+index_name      => 'my_index'(optional)
+embedding_col   => 'image_embedding',
+query_vector    => ARRAY(0.12F, -0.03F, 0.81F, ...),
+k               => 10,
+distance_metric => 'cosine'
+)
+);
+```
+Users can then chain other SQL operations on top of this such as performing 
filters and join on the results.
+
+```
+    // Vector search with WHERE clause filtering
+      val result = spark.sql(
+        s"""
+           |SELECT id, name, price, category, _distance
+           |FROM hudi_vector_search(
+           |  '$tablePath',
+           |  'embedding',
+           |  ARRAY(1.0, 2.0, 3.0),
+           |  10
+           |)
+           |WHERE category = 'electronics' AND price < 100
+           |ORDER BY _distance
+           |""".stripMargin
+      ).collect()
+
+```
+
+```
+  // Vector search with JOIN
+      val result = spark.sql(
+        s"""
+           |SELECT vs.id, vs.name, c.category_name, vs._distance
+           |FROM hudi_vector_search(
+           |  '$productsPath',
+           |  'embedding',
+           |  ARRAY(1.5, 2.5),
+           |  3
+           |) vs
+           |JOIN $categoriesTable c ON vs.category_id = c.category_id
+           |ORDER BY vs._distance
+           |""".stripMargin
+      ).collect()
+
+```
+
+For an intial POC of this RFC please see the following commit in a personal 
branch: 
https://github.com/rahil-c/hudi/commit/8a7e1343a347e8061e0e566563c847cfd4ea2f2c#diff-16b383ee6fd99ebc491f3e15185c585e4a7c0796fc3461b8439d28cfee046468
+
+## Future Enhancements
+
+The next goal would be to integrate vector search into Hudi's indexing and 
metadata capabilities, similar to how Hudi already uses the metadata table and 
secondary indexes (e.g., Bloom, column stats) to accelerate queries.
+This RFC intentionally defers detailed index design to a dedicated 
vector-index RFC, but discusses some ideas at a high level.
+
+#### Vector Index Algorithms

Review Comment:
   Please clearly mark these as Appendix



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] docs: RFC-102 - Spark Vector Search in Apache Hudi [hudi]

Reply via email to