vinothchandar commented on code in PR #14218: URL: https://github.com/apache/hudi/pull/14218#discussion_r2886389603
########## rfc/rfc-102/rfc-102.md: ########## @@ -0,0 +1,231 @@ +<!-- +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to You under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +--> + +# RFC-102: Vector Search in Apache Hudi + +## Proposers + +- @rahil-c + +## Approvers + +- @suryaprasanna +- @vinothchandar +- @balaji-varadarajan-ai +- @yihua + +## Abstract + +This RFC proposes adding the ability to perform a native **vector similarity search** on Hudi tables. + +Building on RFC-100 (unstructured data storage in Hudi), Hudi tables would contain unstructured content (e.g., images, video, documents) as well as the related *embeddings* for those contents. + +The next natural requirement for AI/ML workloads on Hudi is to **search these embeddings efficiently**: + +- Given a query embedding (e.g., derived from an image, text, or audio snippet), +- Find the *k* most similar rows in a Hudi table, +- Return those rows along with a similarity score. + +This RFC focuses on: + +1. Defining a **user-facing API** (via Hudi Spark SQL integration) for vector search on Hudi tables. +2. Describing a **baseline implementation** for how we can achieve this in Spark–Hudi integration. + +Note this RFC will also discuss at a high level **future enhancements** around a vector index to speed up this search. These details will likely be fleshed out in a future RFC around index creation and management. + +## Background + +### What is a Vector Embedding? + +A **vector embedding** is a fixed-dimensional numeric representation (typically an array of `FLOAT`/`DOUBLE`) produced by a model to capture semantic properties of an object (text, image, audio, etc.). For example: + +- An image encoder maps each image to a `d`-dimensional vector capturing visual semantics. +- A text encoder maps a sentence or document to a `d`-dimensional vector capturing its meaning. + +In the context of how vector embeddings and their search would work in Hudi: + +- RFC-100 enables storing unstructured payloads (e.g., image/video/audio/document BLOBs). +- Embedding vectors are typically modeled as `ARRAY<FLOAT>` columns stored alongside those payloads in the same Hudi table, typically as another column within the table. + +This RFC assumes that embeddings are already generated by an encoder/model, and are already stored as numeric vectors in a Hudi table, and focuses on how to **search** them given a new embedding. + +### What is Vector Search? (k-Nearest Neighbors) + +Given the following inputs: + +- A table `T` whose rows contain an embedding column `embedding: ARRAY<FLOAT>`, +- A query embedding `q: ARRAY<FLOAT>`, +- A distance metric (e.g., L2, cosine, dot product), + +**A k-nearest neighbor (k-NN) search** computes a similarity score between `q` and each row's embedding, then returns the *k* closest rows. + +### Visual Example + +Consider a Hudi table that contains an image column and an image embedding column. For illustration, we'll use 3-dimensional embeddings (real systems typically use dimensions much larger): + +| id | image | image_embedding | label | +|----|------------|--------------------|-------| +| 1 | `cat1.jpg` | [0.10, 0.20, 0.30] | cat | +| 2 | `cat2.jpg` | [0.11, 0.18, 0.29] | cat | +| 3 | `dog1.jpg` | [0.90, 0.80, 0.95] | dog | + +Now suppose the user has a new image of a cat (`cat3.jpg`) and uses an external encoder/model to generate its embedding: + +- Query embedding for `cat3.jpg`: `q = [0.09, 0.19, 0.31]` + +A typical query flow for this example would be: + +1. User generates the query embedding `q` for a new image using an encoder/model. +2. User calls a vector search function on the table and passes `q`. +3. The function reads all rows (or a filtered subset) from the Hudi table. +4. For each row, it extracts `image_embedding` and computes the distance to `q` (e.g., L2 Euclidean distance): + + - `dist(q, cat1.jpg) ≈ 0.02` + - `dist(q, cat2.jpg) ≈ 0.03` + - `dist(q, dog1.jpg) ≈ 1.21` + +5. It keeps the *k* rows with the lowest distance (highest similarity). For example, for `k = 2` the results might be: + + | id | image | label | `_distance` | + |----|------------|-------|-------------| + | 1 | `cat1.jpg` | cat | 0.02 | + | 2 | `cat2.jpg` | cat | 0.03 | + + +6. The function returns these rows plus the `_distance` column. +7. The user materializes or joins these top-k similar images for downstream use (e.g., showing “similar images” in an application). + +### User Experience + +Requirements: + +- Users should be able to perform this operation via DataFrames (used heavily in AI/ML) and SQL (used primarily for analytics). +- Expose clear, minimal parameters (table name or path, embedding column in the table, query embedding, which distance algorithm, top-k results). +- Return a table/DataFrame with an extra `_distance` (or `_score`) column that can be leveraged by the user for further data manipulation, such as filtering. + +### Proposed Interface + +We propose a Spark relation that can be used for both Spark DataFrames and Spark SQL. + +```text +┌─────────────────────────────────────────────────────────────┐ +│ User-Facing APIs │ +├──────────────────────────┬──────────────────────────────────┤ +│ DataFrame Extension │ SQL Table-Valued Function │ +│ │ │ +│ df.vector_search(...) │ SELECT * FROM │ +│ │ hudi_vector_search(...) │ +└──────────────┬───────────┴────────────┬─────────────────────┘ + │ │ + ▼ ▼ + ┌────────────────────────────────────────┐ + │ VectorSearchRelation (Scala) │ + │ - Core k-NN search logic │ + │ - Filter pushdown │ + │ - Distance calculations │ + │ - File format integration │ + └────────────────┬───────────────────────┘ + │ + ▼ + ┌────────────────────┐ + │ Hudi Table │ + │ (with vectors) │ + └────────────────────┘ + + + +``` + +### Query Example + +We wanted to follow similar semantics as what other modern data systems offer for performing vector search. +See the following reference material here: +* https://docs.databricks.com/aws/en/sql/language-manual/functions/vector_search +* https://docs.snowflake.com/en/user-guide/snowflake-cortex/vector-embeddings + +An intial start for how a hudi vector interface would look would be something like this +``` +SELECT * +FROM hudi_vector_search( +table name or table_path => 'table' OR 's3://bucket/path/to/hudi/table', +index_name => 'my_index'(optional) +embedding_col => 'image_embedding', +query_vector => ARRAY(0.12F, -0.03F, 0.81F, ...), +k => 10, +distance_metric => 'cosine' +) +); +``` +Users can then chain other SQL operations on top of this such as performing filters and join on the results. + +``` + // Vector search with WHERE clause filtering + val result = spark.sql( + s""" + |SELECT id, name, price, category, _distance + |FROM hudi_vector_search( + | '$tablePath', + | 'embedding', + | ARRAY(1.0, 2.0, 3.0), + | 10 + |) + |WHERE category = 'electronics' AND price < 100 + |ORDER BY _distance + |""".stripMargin + ).collect() + +``` + +``` + // Vector search with JOIN + val result = spark.sql( + s""" + |SELECT vs.id, vs.name, c.category_name, vs._distance + |FROM hudi_vector_search( + | '$productsPath', + | 'embedding', + | ARRAY(1.5, 2.5), + | 3 + |) vs + |JOIN $categoriesTable c ON vs.category_id = c.category_id + |ORDER BY vs._distance + |""".stripMargin + ).collect() + +``` + +For an intial POC of this RFC please see the following commit in a personal branch: https://github.com/rahil-c/hudi/commit/8a7e1343a347e8061e0e566563c847cfd4ea2f2c#diff-16b383ee6fd99ebc491f3e15185c585e4a7c0796fc3461b8439d28cfee046468 + +## Future Enhancements + +The next goal would be to integrate vector search into Hudi's indexing and metadata capabilities, similar to how Hudi already uses the metadata table and secondary indexes (e.g., Bloom, column stats) to accelerate queries. +This RFC intentionally defers detailed index design to a dedicated vector-index RFC, but discusses some ideas at a high level. + +#### Vector Index Algorithms Review Comment: Please clearly mark these as Appendix -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
