GitHub user Beihao-Zhou edited a discussion: Vector Search HNSW Indexing Encoding
# Vector Search HNSW Indexing Encoding ## Goal We aim to integrate the Hierarchical Navigable Small World (HNSW) algorithm into Kvrocks to provide an efficient, scalable solution for approximate nearest neighbour searches in high-dimensional spaces. ## Background  HNSW builds a multi-layered structure where each layer acts as a separate graph. The top layers contain fewer nodes and are used to facilitate fast traversal over large distances within the dataset. Lower layers have more nodes, enabling precise navigation and search in the neighborhood of the query point. In a word, HNSW is a skiplist of graph. The bottom-most level 0 NSW layer contains all information, and we randomly put some vectors to the upper layer (more upper layer has fewer elements), which are also NSW indexes. The search process starts from the upper-most layer, and uses neighbors in that layer as the entry points of the lower layer. *Reference:* [Write a vector Database](https://skyzh.github.io/write-you-a-vector-db/cpp-06-02-hnsw.html) ## HNSW metadata - Key: `index_name` - Value - **Basic Metadata** - `flags` (1 byte) - `expiration` (Ebytes) - `version` (8 bytes) - `size` (Sbytes) - **HNSW Specific Metadata** - `storing_data_type` (1 byte): Vectors could be stored in both JSON and hashes. (0 = hashes, 1 = JSON). - `vector_field` (Sbyte) - `entryPoint` (Sbytes): Entry node ID for search initiation, i.e. key for returned data??. - `type` (1 byte): Enum property to indicate vector type (0 = FLOAT32, 1 = FLOAT64). - `dim` (2 bytes): Stores the vector dimension as a 16-bit integer. - `distanceMetric` (1 byte): Enum for distance metric (0 = L2, 1 = IP, 2 = COSINE). - `initialCap` (4 bytes): Initial capacity as a 32-bit integer. - `M` (2 bytes): Maximum number of outgoing edges per node as a 16-bit integer. - `efConstruction` (4 bytes): Size of the dynamic candidate list during construction as a 32-bit integer. - `efRuntime` (4 bytes): Size of the candidate list during search as a 32-bit integer. - `epsilon` (4 bytes): Floating point to extend search radius, stored as a 32-bit float. - `maxElements` (4 bytes): Maximum elements stored as a 32-bit integer. - `currentLevel` (2 bytes): Highest level nodes currently reach as a 16-bit integer. ## HNSW Graph sub key-values ### Nodes - Key: `index_name | level | node_id` - `level` is the current level of the node - `node_id` is the key for the indexed data point - Value: `num_neighbours` ### Edges - Key: `index_name | level | node_id | connected_node_id` - Value: `computed_distance` ## Inverted Index Key-values After you create an index, Redis Stack automatically indexes any existing, modified, or newly created JSON documents stored in the database. When inserting a new entry, it should be aware that it’s part of the HNSW indexing. - Key: `type | prefix | vector_field` - Value: `index_name` ## APIs All interfaces and internal APIs will be implemented based on: - [HNSW Algorithm Paper](https://arxiv.org/pdf/1603.09320) - [Write your own database HNSW solution](https://github.com/skyzh/bustub-vectordb/blob/497b6858903306d9704ac3cc9b06abd83e77203e/src/storage/index/hnsw_index.cpp) ## Appendix - Redis Vector: [https://redis.io/docs/latest/develop/interact/search-and-query/advanced-concepts/vectors/](https://redis.io/docs/latest/develop/interact/search-and-query/advanced-concepts/vectors/) - Kvrocks data structure encoding: [https://kvrocks.apache.org/community/data-structure-on-rocksdb/](https://kvrocks.apache.org/community/data-structure-on-rocksdb/) - Other Sample Impl: https://github.com/swapneel/hnsw-rust GitHub link: https://github.com/apache/kvrocks/discussions/2316 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
