GitHub user Beihao-Zhou edited a discussion: Vector Search HNSW Indexing 
Encoding

# Vector Search HNSW Indexing Encoding

## Goal

We aim to integrate the Hierarchical Navigable Small World (HNSW) algorithm 
into Kvrocks to provide an efficient, scalable solution for approximate nearest 
neighbour searches in high-dimensional spaces. 

## Background
![06-hnsw-explore](https://github.com/apache/kvrocks/assets/68083940/93f37a2a-e140-4c3a-9e1c-c63f2b64a39f)

HNSW builds a multi-layered structure where each layer acts as a separate 
graph. The top layers contain fewer nodes and are used to facilitate fast 
traversal over large distances within the dataset. Lower layers have more 
nodes, enabling precise navigation and search in the neighborhood of the query 
point. In a word, HNSW is a skiplist of graph. 

The bottom-most level 0 NSW layer contains all information, and we randomly put 
some vectors to the upper layer (more upper layer has fewer elements), which 
are also NSW indexes. The search process starts from the upper-most layer, and 
uses neighbors in that layer as the entry points of the lower layer.

*Reference:* [Write a vector 
Database](https://skyzh.github.io/write-you-a-vector-db/cpp-06-02-hnsw.html)

## HNSW metadata

- Key: `index_name`
- Value
    - **Basic Metadata**
        - `flags` (1 byte)
        - `expiration` (Ebytes)
        - `version` (8 bytes)
        - `size` (Sbytes)
    - **HNSW Specific Metadata**
        - `storing_data_type` (1 byte): Vectors could be stored in both JSON 
and hashes. (0 = hashes, 1 = JSON).
        - `vector_field` (Sbyte)
        - `entryPoint` (Sbytes): Entry node ID for search initiation, i.e. key 
for returned data??.
        - `type` (1 byte): Enum property to indicate vector type (0 = FLOAT32, 
1 = FLOAT64).
        - `dim` (2 bytes): Stores the vector dimension as a 16-bit integer.
        - `distanceMetric` (1 byte): Enum for distance metric (0 = L2, 1 = IP, 
2 = COSINE).
        - `initialCap` (4 bytes): Initial capacity as a 32-bit integer.
        - `M` (2 bytes): Maximum number of outgoing edges per node as a 16-bit 
integer.
        - `efConstruction` (4 bytes): Size of the dynamic candidate list during 
construction as a 32-bit integer.
        - `efRuntime` (4 bytes): Size of the candidate list during search as a 
32-bit integer.
        - `epsilon` (4 bytes): Floating point to extend search radius, stored 
as a 32-bit float.
        - `maxElements` (4 bytes): Maximum elements stored as a 32-bit integer.
        - `currentLevel` (2 bytes): Highest level nodes currently reach as a 
16-bit integer.

## HNSW Graph sub key-values

### Nodes

- Key: `index_name | level | node_id`
    - `level` is the current level of the node
    - `node_id` is the key for the indexed data point
- Value: `num_neighbours`

### Edges

- Key: `index_name | level | node_id | connected_node_id`
- Value: `computed_distance`

## Inverted Index Key-values

After you create an index, Redis Stack automatically indexes any existing, 
modified, or newly created JSON documents stored in the database. When 
inserting a new entry, it should be aware that it’s part of the HNSW indexing. 

- Key: `type | prefix | vector_field`
- Value: `index_name`

## APIs

All interfaces and internal APIs will be implemented based on:

- [HNSW Algorithm Paper](https://arxiv.org/pdf/1603.09320)
- [Write your own database HNSW 
solution](https://github.com/skyzh/bustub-vectordb/blob/497b6858903306d9704ac3cc9b06abd83e77203e/src/storage/index/hnsw_index.cpp)

## Appendix

- Redis Vector: 
[https://redis.io/docs/latest/develop/interact/search-and-query/advanced-concepts/vectors/](https://redis.io/docs/latest/develop/interact/search-and-query/advanced-concepts/vectors/)
- Kvrocks data structure encoding: 
[https://kvrocks.apache.org/community/data-structure-on-rocksdb/](https://kvrocks.apache.org/community/data-structure-on-rocksdb/)
- Other Sample Impl: https://github.com/swapneel/hnsw-rust

GitHub link: https://github.com/apache/kvrocks/discussions/2316

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to