RKSPD opened a new pull request, #14892:
URL: https://github.com/apache/lucene/pull/14892
## Motivation
Lucene’s built‑in HNSW KnnVectorsFormat delivers strong recall/latency, but
its index must reside entirely in RAM. As demand for vector datasets of larger
dimensionality and greater index size increases, the cost of scaling systems
like HNSW become prohibitively expensive.
JVector is a pure‑Java ANN engine that ultimately aims to merge DiskANN’s
disk‑resident search with HNSW’s navigable‑small‑world graph. Today the library
still loads the whole graph in RAM (like plain HNSW), but its public roadmap is
moving toward split‑layer storage where only the upper graph levels live in
memory and deeper layers + raw vectors remain on disk.
OpenSearch has successfully integrated JVector through the
OpenSearch-JVector repository, but the current implementation contains several
OpenSearch-specific dependencies. As OpenSearch continues to develop new
features and optimization to their codec, this implementation allows the
continual development and testing of those features in Lucene itself. As such,
with this PR, I will also include a link to a luceneutil-jvector repository
that works with the proposed JVector codec without significant modifications.
## Dependency Information
* **`io.github.jbellis:jvector:4.0.0-beta.6`** – the ANN engine (automatic
module `jvector`)
* **`org.agrona:agrona:1.20.0`** – off-heap buffer utilities
* **`org.apache.commons:commons-math3:3.6.1`** – PQ math helpers
* **`org.yaml:snakeyaml:2.4`** – only needed if you load YAML tuning files
* **`org.slf4j:slf4j-api:2.0.17`** – logging façade (overrides JVector’s
2.0.16 to match the rest of Lucene)
* *All jars have matching LICENSE/NOTICE entries added under
`lucene/licenses/`*
## Vector Codec – design highlights
*Per-segment, per-field indexes*
Each Lucene segment owns its own JVector graph index. The graph payloads
live in a single *.data-jvector file and the per-field metadata lives in a
companion *.meta-jvector file, mirroring Lucene’s existing *.vec/ *.vex layout
*Bulk build at flush time*
Vectors are streamed into the ordinary flat-vector writer while an in-memory
OnHeapGraphIndex is built.
When the segment flushes, the whole graph (and optional Product-Quantization
code-books) is handed to OnDiskSequentialGraphIndexWriter and serialized to
disk in one pass
*Single data file, concatenated fields*
All field-specific graphs (and PQ blobs) are appended one after another
inside *.data-jvector; their start-offsets, lengths and build parameters are
recorded in *.meta-jvector so the reader can jump straight to the right slice
*Zero-copy loading on open*
JVectorReader memory-maps the data file and spawns a lightweight
OnDiskGraphIndex for each field via ReaderSupplier. No temp files are created;
the mmap’d bytes are shared across threads and searches
*Pure-Java search path*
At query time the float vector is passed directly to GraphSearcher
(DiskANN-style). Results are optionally re-ranked with an exact scorer, then
surfaced through a thin JVectorKnnCollector wrapper so the rest of Lucene sees
a normal TopDocs
*Ordinal → doc-ID mapping still in Lucene*
JVector returns internal ordinals; we convert them to docIDs using Lucene’s
existing ordinal map during collection.
## Initial Benchmark Results
### Small Corpus Testing (Wikipedia Cohere 768, 200k docs)
```
Results: Lucene
recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn
beamWidth quantized index(s) index_docs/s force_merge(s) num_segments
index_size(MB) overSample vec_disk(MB) vec_RAM(MB) indexType
0.803 2.800 2.343 0.837 200000 100 300 12
16 7 bits 8.46 23646.25 7.58 1
736.49 1.000 733.185 147.247 HNSW
0.822 2.486 2.286 0.920 200000 100 300 12
20 7 bits 7.33 27273.97 7.45 1
736.76 1.000 733.185 147.247 HNSW
0.857 2.657 2.429 0.914 200000 100 300 12
28 7 bits 13.64 14658.46 8.97 1
737.15 1.000 733.185 147.247 HNSW
0.831 2.771 2.514 0.907 200000 100 300 16
16 7 bits 6.44 31075.20 7.42 1
736.61 1.000 733.185 147.247 HNSW
0.846 2.857 2.571 0.900 200000 100 300 16
20 7 bits 7.19 27812.54 8.42 1
736.86 1.000 733.185 147.247 HNSW
0.869 3.029 2.657 0.877 200000 100 300 16
28 7 bits 8.47 23626.70 10.04 1
737.17 1.000 733.185 147.247 HNSW
0.847 2.829 2.486 0.879 200000 100 300 20
16 7 bits 6.11 32717.16 7.05 1
736.68 1.000 733.185 147.247 HNSW
0.862 2.743 2.429 0.885 200000 100 300 20
20 7 bits 6.92 28893.38 8.13 1
736.88 1.000 733.185 147.247 HNSW
0.883 3.086 2.743 0.889 200000 100 300 20
28 7 bits 7.94 25176.23 8.90 1
737.26 1.000 733.185 147.247 HNSW
0.860 2.943 2.657 0.903 200000 100 300 24
16 7 bits 9.37 21342.44 7.21 1
736.69 1.000 733.185 147.247 HNSW
0.880 3.371 3.143 0.932 200000 100 300 24
20 7 bits 7.77 25749.97 8.38 1
736.92 1.000 733.185 147.247 HNSW
0.900 3.086 2.886 0.935 200000 100 300 24
28 7 bits 8.37 23900.57 9.70 1
737.29 1.000 733.185 147.247 HNSW
```
```
Results: JVector
recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn
beamWidth quantized index(s) index_docs/s force_merge(s) num_segments
index_size(MB) vec_disk(MB) vec_RAM(MB) indexType
0.877 3.943 3.714 0.942 200000 100 300 12
16 7 bits 12.94 15458.34 101.28 1
1197.28 733.185 147.247 HNSW
0.901 3.771 3.629 0.962 200000 100 300 12
20 7 bits 13.89 14394.70 123.37 1
1197.28 733.185 147.247 HNSW
0.913 3.457 3.314 0.959 200000 100 300 12
28 7 bits 18.52 10802.05 136.17 1
1197.28 733.185 147.247 HNSW
0.915 3.743 3.571 0.954 200000 100 300 16
16 7 bits 15.16 13193.48 118.83 1
1200.28 733.185 147.247 HNSW
0.921 4.029 3.857 0.957 200000 100 300 16
20 7 bits 18.83 10620.22 134.91 1
1200.28 733.185 147.247 HNSW
0.931 3.886 3.714 0.956 200000 100 300 16
28 7 bits 22.87 8746.61 174.35 1
1200.28 733.185 147.247 HNSW
0.921 5.400 5.257 0.974 200000 100 300 20
16 7 bits 15.68 12758.36 126.82 1
1203.30 733.185 147.247 HNSW
0.929 4.229 4.057 0.959 200000 100 300 20
20 7 bits 19.68 10161.57 152.86 1
1203.30 733.185 147.247 HNSW
0.942 4.343 4.171 0.961 200000 100 300 20
28 7 bits 27.79 7197.35 212.50 1
1203.30 733.185 147.247 HNSW
0.930 4.257 4.086 0.960 200000 100 300 24
16 7 bits 17.47 11449.51 131.11 1
1206.33 733.185 147.247 HNSW
0.943 4.314 4.143 0.960 200000 100 300 24
20 7 bits 21.34 9371.63 162.54 1
1206.33 733.185 147.247 HNSW
0.940 4.914 4.743 0.965 200000 100 300 24
28 7 bits 29.75 6722.24 235.78 1
1206.33 733.185 147.247 HNSW
```
# Testing JVectorCodec Using luceneutil-jvector
This guide provides step-by-step instructions for benchmarking and testing
JVectorCodec performance using the luceneutil-jvector testing framework.
## Prerequisites
* Java development environment with Gradle support
* Python 3.x installed
* Git installed
* SSD storage recommended for optimal performance
## Setup Instructions
### 1. Environment Preparation
Create a benchmark directory on an SSD for optimal I/O performance:
```
mkdir LUCENE_BENCH_HOME
cd LUCENE_BENCH_HOME
```
### 2. Repository Cloning
Clone the required repositories:
```
git clone https://github.com/RKSPD/lucene-jvector lucene_candidate
git clone https://github.com/RKSPD/luceneutil-jvector util
```
**Note:** The `lucene-jvector` repository contains the same code as the PR
under review.
### 3. Initial Setup and Data Download
Navigate to the utilities directory and run the initial setup:
```
cd util
python3 src/python/initial_setup.py -d
```
This command will download the necessary test datasets. The download process
may take some time depending on your internet connection.
### 4. Lucene Build
While the data is downloading, open a new terminal session and build Lucene:
```
cd LUCENE_BENCH_HOME/lucene_candidate
./gradlew build
```
## Running Performance Tests
### 5. Initial Test Run
Once both the build and download processes are complete, navigate back to
the utilities directory:
```
cd LUCENE_BENCH_HOME/util
```
Run the KNN performance test:
```
./gradlew runKnnPerfTest
```
**Important:** The first execution will fail as expected. This initial run
generates the path definitions for your Lucene repository and determines the
Lucene version.
### 6. Successful Test Execution
Run the performance test a second time:
```
./gradlew runKnnPerfTest
```
This execution should complete successfully and provide performance metrics.
## Configuration and Tuning
### 7. Parameter Customization
To customize the testing parameters for your specific benchmarking needs:
#### Merge Policy Configuration
* **File:** `util/src/main/knn/KnnIndexer.java`
* **Purpose:** Configure the merge policy for index optimization
#### Codec Configuration
* **File:** `util/src/main/knn/KnnGraphTester.java`
* **Method:** `getCodec()`
* **Purpose:** Specify which codec implementation to test
#### Performance Test Parameters
* **File:** `src/python/knnPerfTest.py`
* **Section:** `params` block
* **Purpose:** Adjust various performance testing parameters including:
* Vector dimensions
* Index size
* Query parameters
* Recall targets
* Other algorithm-specific settings
## Expected Outcomes
Upon successful completion, you will have:
* A fully configured benchmarking environment
* Performance metrics comparing JVectorCodec against baseline implementations
* Configurable parameters for comprehensive testing scenarios
## Troubleshooting
* Ensure sufficient disk space for dataset downloads and index generation
* Verify Java and Python environments are properly configured
* Check network connectivity if initial setup fails during download phase
* Confirm SSD usage for optimal I/O performance during benchmarking
## Long-Term Considerations
**Split-layer storage roadmap**
* JVector aims to only keep upper graph levels in RAM while deeper layers
and raw vectors live on disk. Plan for API changes and configuration knobs as
this feature stabilizes.
**Backwards compatibility with previous JVector implementations**
* As the codec changes, there’s no guarantee whether indexes generated in
past JVectorCodec implementations will work with new version of JVector.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]