[PR] Add JVector Codec to Lucene for ANN Searches [lucene]

via GitHub Tue, 01 Jul 2025 18:22:23 -0700


RKSPD opened a new pull request, #14892:
URL: https://github.com/apache/lucene/pull/14892


   ## Motivation
   
   Lucene’s built‑in HNSW KnnVectorsFormat delivers strong recall/latency, but 
its index must reside entirely in RAM. As demand for vector datasets of larger 
dimensionality and greater index size increases, the cost of scaling systems 
like HNSW become prohibitively expensive.
   
   JVector is a pure‑Java ANN engine that ultimately aims to merge DiskANN’s 
disk‑resident search with HNSW’s navigable‑small‑world graph. Today the library 
still loads the whole graph in RAM (like plain HNSW), but its public roadmap is 
moving toward split‑layer storage where only the upper graph levels live in 
memory and deeper layers + raw vectors remain on disk.
   
   OpenSearch has successfully integrated JVector through the 
OpenSearch-JVector repository, but the current implementation contains several 
OpenSearch-specific dependencies. As OpenSearch continues to develop new 
features and optimization to their codec, this implementation allows the 
continual development and testing of those features in Lucene itself. As such, 
with this PR, I will also include a link to a luceneutil-jvector repository 
that works with the proposed JVector codec without significant modifications.
   
   
   ## Dependency Information
   
   * **`io.github.jbellis:jvector:4.0.0-beta.6`** – the ANN engine (automatic 
module `jvector`)
   * **`org.agrona:agrona:1.20.0`** – off-heap buffer utilities
   * **`org.apache.commons:commons-math3:3.6.1`** – PQ math helpers
   * **`org.yaml:snakeyaml:2.4`** – only needed if you load YAML tuning files
   * **`org.slf4j:slf4j-api:2.0.17`** – logging façade (overrides JVector’s 
2.0.16 to match the rest of Lucene)
   * *All jars have matching LICENSE/NOTICE entries added under 
`lucene/licenses/`*
   
   ## Vector Codec – design highlights
   
   
   *Per-segment, per-field indexes*
   Each Lucene segment owns its own JVector graph index. The graph payloads 
live in a single *.data-jvector file and the per-field metadata lives in a 
companion *.meta-jvector file, mirroring Lucene’s existing *.vec/ *.vex layout
   
   *Bulk build at flush time*
   Vectors are streamed into the ordinary flat-vector writer while an in-memory 
OnHeapGraphIndex is built.
   When the segment flushes, the whole graph (and optional Product-Quantization 
code-books) is handed to OnDiskSequentialGraphIndexWriter and serialized to 
disk in one pass
   
   *Single data file, concatenated fields*
   All field-specific graphs (and PQ blobs) are appended one after another 
inside *.data-jvector; their start-offsets, lengths and build parameters are 
recorded in *.meta-jvector so the reader can jump straight to the right slice
   
   *Zero-copy loading on open*
   JVectorReader memory-maps the data file and spawns a lightweight 
OnDiskGraphIndex for each field via ReaderSupplier. No temp files are created; 
the mmap’d bytes are shared across threads and searches
   
   *Pure-Java search path*
   At query time the float vector is passed directly to GraphSearcher 
(DiskANN-style). Results are optionally re-ranked with an exact scorer, then 
surfaced through a thin JVectorKnnCollector wrapper so the rest of Lucene sees 
a normal TopDocs
   
   *Ordinal → doc-ID mapping still in Lucene*
   JVector returns internal ordinals; we convert them to docIDs using Lucene’s 
existing ordinal map during collection.
   
   
   ## Initial Benchmark Results
   
   ### Small Corpus Testing (Wikipedia Cohere 768, 200k docs)
   
   ```
   Results: Lucene
   recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  
beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  
index_size(MB)  overSample  vec_disk(MB)  vec_RAM(MB)  indexType
    0.803        2.800   2.343        0.837  200000   100     300       12      
   16     7 bits      8.46      23646.25            7.58             1          
736.49       1.000       733.185      147.247       HNSW
    0.822        2.486   2.286        0.920  200000   100     300       12      
   20     7 bits      7.33      27273.97            7.45             1          
736.76       1.000       733.185      147.247       HNSW
    0.857        2.657   2.429        0.914  200000   100     300       12      
   28     7 bits     13.64      14658.46            8.97             1          
737.15       1.000       733.185      147.247       HNSW
    0.831        2.771   2.514        0.907  200000   100     300       16      
   16     7 bits      6.44      31075.20            7.42             1          
736.61       1.000       733.185      147.247       HNSW
    0.846        2.857   2.571        0.900  200000   100     300       16      
   20     7 bits      7.19      27812.54            8.42             1          
736.86       1.000       733.185      147.247       HNSW
    0.869        3.029   2.657        0.877  200000   100     300       16      
   28     7 bits      8.47      23626.70           10.04             1          
737.17       1.000       733.185      147.247       HNSW
    0.847        2.829   2.486        0.879  200000   100     300       20      
   16     7 bits      6.11      32717.16            7.05             1          
736.68       1.000       733.185      147.247       HNSW
    0.862        2.743   2.429        0.885  200000   100     300       20      
   20     7 bits      6.92      28893.38            8.13             1          
736.88       1.000       733.185      147.247       HNSW
    0.883        3.086   2.743        0.889  200000   100     300       20      
   28     7 bits      7.94      25176.23            8.90             1          
737.26       1.000       733.185      147.247       HNSW
    0.860        2.943   2.657        0.903  200000   100     300       24      
   16     7 bits      9.37      21342.44            7.21             1          
736.69       1.000       733.185      147.247       HNSW
    0.880        3.371   3.143        0.932  200000   100     300       24      
   20     7 bits      7.77      25749.97            8.38             1          
736.92       1.000       733.185      147.247       HNSW
    0.900        3.086   2.886        0.935  200000   100     300       24      
   28     7 bits      8.37      23900.57            9.70             1          
737.29       1.000       733.185      147.247       HNSW
   ```
   
   ```
   Results: JVector
   recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  
beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  
index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
    0.877        3.943   3.714        0.942  200000   100     300       12      
   16     7 bits     12.94      15458.34          101.28             1         
1197.28       733.185      147.247       HNSW
    0.901        3.771   3.629        0.962  200000   100     300       12      
   20     7 bits     13.89      14394.70          123.37             1         
1197.28       733.185      147.247       HNSW
    0.913        3.457   3.314        0.959  200000   100     300       12      
   28     7 bits     18.52      10802.05          136.17             1         
1197.28       733.185      147.247       HNSW
    0.915        3.743   3.571        0.954  200000   100     300       16      
   16     7 bits     15.16      13193.48          118.83             1         
1200.28       733.185      147.247       HNSW
    0.921        4.029   3.857        0.957  200000   100     300       16      
   20     7 bits     18.83      10620.22          134.91             1         
1200.28       733.185      147.247       HNSW
    0.931        3.886   3.714        0.956  200000   100     300       16      
   28     7 bits     22.87       8746.61          174.35             1         
1200.28       733.185      147.247       HNSW
    0.921        5.400   5.257        0.974  200000   100     300       20      
   16     7 bits     15.68      12758.36          126.82             1         
1203.30       733.185      147.247       HNSW
    0.929        4.229   4.057        0.959  200000   100     300       20      
   20     7 bits     19.68      10161.57          152.86             1         
1203.30       733.185      147.247       HNSW
    0.942        4.343   4.171        0.961  200000   100     300       20      
   28     7 bits     27.79       7197.35          212.50             1         
1203.30       733.185      147.247       HNSW
    0.930        4.257   4.086        0.960  200000   100     300       24      
   16     7 bits     17.47      11449.51          131.11             1         
1206.33       733.185      147.247       HNSW
    0.943        4.314   4.143        0.960  200000   100     300       24      
   20     7 bits     21.34       9371.63          162.54             1         
1206.33       733.185      147.247       HNSW
    0.940        4.914   4.743        0.965  200000   100     300       24      
   28     7 bits     29.75       6722.24          235.78             1         
1206.33       733.185      147.247       HNSW
   ```
   
   # Testing JVectorCodec Using luceneutil-jvector
   
   This guide provides step-by-step instructions for benchmarking and testing 
JVectorCodec performance using the luceneutil-jvector testing framework.
   
   ## Prerequisites
   
   * Java development environment with Gradle support
   * Python 3.x installed
   * Git installed
   * SSD storage recommended for optimal performance
   
   ## Setup Instructions
   
   ### 1. Environment Preparation
   
   Create a benchmark directory on an SSD for optimal I/O performance:
   
   ```
   mkdir LUCENE_BENCH_HOME
   cd LUCENE_BENCH_HOME
   ```
   
   ### 2. Repository Cloning
   
   Clone the required repositories:
   
   ```
   git clone https://github.com/RKSPD/lucene-jvector lucene_candidate
   git clone https://github.com/RKSPD/luceneutil-jvector util
   ```
   
   **Note:** The `lucene-jvector` repository contains the same code as the PR 
under review.
   
   ### 3. Initial Setup and Data Download
   
   Navigate to the utilities directory and run the initial setup:
   
   ```
   cd util
   python3 src/python/initial_setup.py -d
   ```
   
   This command will download the necessary test datasets. The download process 
may take some time depending on your internet connection.
   
   ### 4. Lucene Build
   
   While the data is downloading, open a new terminal session and build Lucene:
   
   ```
   cd LUCENE_BENCH_HOME/lucene_candidate
   ./gradlew build
   ```
   
   ## Running Performance Tests
   
   ### 5. Initial Test Run
   
   Once both the build and download processes are complete, navigate back to 
the utilities directory:
   
   ```
   cd LUCENE_BENCH_HOME/util
   ```
   
   Run the KNN performance test:
   
   ```
   ./gradlew runKnnPerfTest
   ```
   
   **Important:** The first execution will fail as expected. This initial run 
generates the path definitions for your Lucene repository and determines the 
Lucene version.
   
   ### 6. Successful Test Execution
   
   Run the performance test a second time:
   
   ```
   ./gradlew runKnnPerfTest
   ```
   
   This execution should complete successfully and provide performance metrics.
   
   ## Configuration and Tuning
   
   ### 7. Parameter Customization
   
   To customize the testing parameters for your specific benchmarking needs:
   
   #### Merge Policy Configuration
   
   * **File:** `util/src/main/knn/KnnIndexer.java`
   * **Purpose:** Configure the merge policy for index optimization
   
   #### Codec Configuration
   
   * **File:** `util/src/main/knn/KnnGraphTester.java`
   * **Method:** `getCodec()`
   * **Purpose:** Specify which codec implementation to test
   
   #### Performance Test Parameters
   
   * **File:** `src/python/knnPerfTest.py`
   * **Section:** `params` block
   * **Purpose:** Adjust various performance testing parameters including: 
       * Vector dimensions
       * Index size
       * Query parameters
       * Recall targets
       * Other algorithm-specific settings
   
   ## Expected Outcomes
   
   Upon successful completion, you will have:
   
   * A fully configured benchmarking environment
   * Performance metrics comparing JVectorCodec against baseline implementations
   * Configurable parameters for comprehensive testing scenarios
   
   ## Troubleshooting
   
   * Ensure sufficient disk space for dataset downloads and index generation
   * Verify Java and Python environments are properly configured
   * Check network connectivity if initial setup fails during download phase
   * Confirm SSD usage for optimal I/O performance during benchmarking
   
   ## Long-Term Considerations
   
   **Split-layer storage roadmap**  
   
   * JVector aims to only keep upper graph levels in RAM while deeper layers 
and raw vectors live on disk. Plan for API changes and configuration knobs as 
this feature stabilizes.
   
   **Backwards compatibility with previous JVector implementations**
   
   * As the codec changes, there’s no guarantee whether indexes generated in 
past JVectorCodec implementations will work with new version of JVector.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Add JVector Codec to Lucene for ANN Searches [lucene]

Reply via email to