Re: [PR] Add TurboQuant rotation-based vector quantization codec to sandbox [lucene]

via GitHub Mon, 06 Apr 2026 15:53:45 -0700


shbhar commented on PR #15903:
URL: https://github.com/apache/lucene/pull/15903#issuecomment-4195152417


   Hi all,  
   
   I also work at Amazon but in Advertising (a different org from @xande's 
product search) where we also use Lucene heavily and I've been independently 
iterating on a TurboQuant implementation for the past week (also with Kiro CLI) 
with various tests and benchmarks but with a different approach from this PR 
where I focus more on 1 bit TQ & never store fp32 vectors to get the full 
compression benefits. I have early comparison benchmark data below and more 
benchmarks are still running (after some bug fixes) and I'll update as more 
data comes in. Branch: https://github.com/shbhar/lucene/tree/turboquant-v1
   
   Below is a summary of the approach and current results co-authored with 
claude 4.6
   
   ## Design philosophy: quantized-only storage
   
   The implementation stores only quantized data on disk — no float32 vectors 
alongside. The key insight is that TQ's quantization quality after FWHT 
rotation is so good especially at higher dimensions that float32 rescoring is 
unnecessary for most use cases. Instead, I rescore directly from quantized data 
using centroid lookup tables.
   
   For users who do need higher-fidelity rescore, this doesn't require baking 
float32 into the codec — they can store vectors in a separate field and use 
Lucene's existing `FloatVectorSimilarityValuesSource` with a rescore query. A 
TurboQuant-specific `ValuesSource` that rescores from a higher-bitwidth TQ 
field (e.g., search at 1-bit, rescore at 8-bit) is also straightforward to add. 
I started with the approach of baking in higher bitwidth TQ vectors for 
rescoring in the codec, but reverted it after realizing that search+rescore 
with quantized only vectors is actually viable and the choice is better left to 
users.
   
   The storage impact at 1M × 4096d (Qwen3-8B embeddings):
   
   | Method | Index Size | Compression |
   |--------|-----------|-------------|
   | Float32 | 15,674 MB | 1× |
   | BBQ-1bit (quantized + float32) | 16,178 MB | 0.97× (larger!) |
   | **TQ-1bit (quantized only)** | **539 MB** | **29×** |
   | **TQ-4bit (quantized only)** | **2,000 MB** | **7.8×** |
   | **TQ-8bit (quantized only)** | **3,951 MB** | **4×** |
   
   This is possible because the centroid LUT rescore operates on the same 
packed bytes as search — `score = queryNorm × docNorm × Σ centroid[bin[i]] × 
rotatedQuery[i]`. No float32 vectors needed.
   
   ## Dimension scaling: "blessing of dimensionality"
   
   100K MS MARCO passages, Qwen3-8B, MRL-truncated to test lower dimensions. 
HNSW (M=32, beamWidth=100, topK=10, fanout=50).
   
   First, raw quantized recall without rescore — this isolates quantization 
quality during HNSW graph traversal:
   
   | Dim | Float32 | SQ-4bit | SQ-8bit | BBQ-1bit | TQ-1bit | TQ-4bit | TQ-8bit 
|
   
|-----|---------|---------|---------|----------|---------|---------|---------|
   | 128 | 0.973 | 0.846 | 0.962 | 0.477 | 0.303 | 0.799 | 0.949 |
   | 512 | 0.971 | 0.822 | 0.903 | 0.632 | 0.587 | **0.878** | **0.956** |
   | 1024 | 0.973 | 0.853 | 0.901 | 0.721 | **0.720** | **0.907** | **0.960** |
   | 4096 | 0.977 | 0.804 | 0.838 | 0.722 | **0.807** | **0.929** | **0.962** |
   
   SQ degrades with dimension (0.846→0.804 at 4-bit, 0.962→0.838 at 8-bit) 
while TQ improves (0.303→0.807 at 1-bit, 0.799→0.929 at 4-bit). 
   
   At ≥1024d, TQ-1bit already matches BBQ-1bit (0.720 vs 0.721), and 
TQ-4bit/8bit surpass their SQ counterparts at ≥512d. At 4096d, TQ-1bit (0.807) 
surpasses SQ-4bit (0.804) without any rescore.
   
   With 5× rescore, the pattern holds and latency tells the full story:
   
   | Dim | SQ-4bit+rsc | | SQ-8bit+rsc | | BBQ-1bit+rsc | | TQ-1bit+rsc | | 
TQ-4bit+rsc | |
   |-----|------|------|------|------|------|------|------|------|------|------|
   | | R@10 | lat | R@10 | lat | R@10 | lat | R@10 | lat | R@10 | lat |
   | 128 | 0.994 | 0.70ms | 0.997 | 0.78ms | 0.845 | 0.71ms | 0.621 | 0.74ms | 
**0.995** | 0.76ms |
   | 512 | 0.994 | 0.93ms | 0.997 | 1.39ms | 0.934 | 0.79ms | 0.902 | 0.96ms | 
**0.998** | 1.28ms |
   | 1024 | 0.996 | 1.29ms | 0.997 | 1.89ms | 0.964 | 0.85ms | **0.962** | 
1.00ms | **0.997** | 1.88ms |
   | 2048 | 0.978 | 1.62ms | 0.989 | 2.87ms | 0.931 | 1.08ms | **0.990** | 
**1.18ms** | **0.988** | 2.67ms |
   | 4096 | 0.986 | 2.77ms | 0.991 | 5.06ms | 0.951 | 1.48ms | **0.997** | 
**1.53ms** | **0.997** | 5.19ms |
   
   At 4096d, TQ-1bit+rsc achieves the highest recall (0.997) at the lowest 
latency (1.53ms) — beating SQ-8bit+rsc (0.991, 5.06ms) on both recall and 
latency, at 30× less storage. 
   
   At ≥1024d, TQ-1bit+rsc matches or exceeds BBQ-1bit+rsc, and at ≥2048d it's 
so strong on every axis (recall, latency, storage) that higher bit widths may 
not even be necessary depending on the dataset. BBQ-1bit+rsc plateaus at 0.951 
because its float32 rescore can't recover from the binary quantization error at 
high dimensions, while TQ-1bit's centroid LUT rescore continues improving.
   
   ## Early Benchmark data: 1M ASIN Vectors, Qwen3-8B, 4096d
   
   1M Amazon product ASINs encoded with Qwen3-Embedding-8B at native 4096 
dimensions. 5K real product search queries. HNSW (M=32, beamWidth=200, topK=10, 
fanout=50, forceMerge to 1 segment).
   
   Note: TQ-4bit and TQ-8bit had a int overflow bug during merge path in this 
run that caused multi-segment indices — latency and forceMerge times for those 
methods were wrong and are omitted. Recall and index size should be mostly 
unaffected. Re-running all methods with the fix and more SQ options - will 
update when done (~12 hours)
   
   | Method | R@10 | Lat (ms) | Docs/s | FMerge (s) | Index MB |
   |--------|------|----------|--------|------------|----------|
   | Float32 | 0.931 | 0.84 | 5,516 | 386 | 15,674 |
   | BBQ-1bit | 0.771 | 0.57 | 12,986 | 418 | 16,178 |
   | BBQ-1bit+5×rsc | 0.977 | 1.49 | 12,847 | 419 | 16,178 |
   | BBQ-1bit+10×rsc | 0.987 | 2.30 | 13,463 | 432 | 16,178 |
   | **TQ-1bit** | **0.741** | **0.48** | **19,924** | **103** | **539** |
   | **TQ-1bit+5×rsc** | **0.968** | **1.56** | **19,966** | **283** | **539** |
   | **TQ-1bit+10×rsc** | **0.985** | **2.43** | **20,054** | **205** | **539** 
|
   | TQ-4bit† | 0.856 | — | 8,386 | — | 2,000 |
   | TQ-4bit+5×rsc† | 0.975 | — | 8,252 | — | 2,000 |
   | TQ-8bit† | 0.946 | — | 6,825 | — | 3,951 |
   | TQ-8bit+5×rsc† | 0.984 | — | 6,680 | — | 3,951 |
   | SQ-4bit+5×rsc | 0.977 | 2.39 | 9,276 | 509 | 17,642 |
   
   † Latency/merge omitted — int overflow caused multi-segment index. Recall is 
largely valid.
   
   TQ-1bit+10×rsc (0.985) matches BBQ-1bit+10×rsc (0.987) at 30× less storage 
(539 MB vs 16,178 MB), with 1.5× faster indexing (20K docs/s vs BBQ's 13K) and 
4× faster forceMerge (103s vs BBQ's 418s).
   
   ## Early Benchmark data: 5M Cohere Wikipedia, 1024d
   
   From a previous run where I had reliable TQ-1bit numbers (but TQ-4bit/8bit 
were affected by the same int overflow bug at this scale — fixed and re-running 
this one too, will update):
   
   | Method | R@10 | Latency | Docs/s | FMerge (s) | Index MB |
   |--------|------|---------|--------|------------|----------|
   | Float32 | 0.929 | 1.51ms | 13,543 | 2,194 | 20,021 |
   | SQ-4bit | 0.854 | 0.82ms | 19,553 | 1,310 | 22,540 |
   | SQ-4bit+5×rsc | 0.984 | 2.90ms | 19,323 | 1,303 | 22,540 |
   | SQ-8bit | 0.917 | 1.17ms | 15,769 | 1,737 | 24,979 |
   | SQ-8bit+5×rsc | 0.984 | 3.92ms | 15,689 | 1,759 | 24,981 |
   | BBQ-1bit | 0.636 | 0.69ms | 23,384 | 1,170 | 20,743 |
   | BBQ-1bit+5×rsc | 0.945 | 2.37ms | 22,919 | 1,150 | 20,744 |
   | **TQ-1bit** | **0.605** | **0.68ms** | 30,381 | 1,748 | **1,063** |
   | **TQ-1bit+5×rsc** | **0.928** | **2.49ms** | 30,460 | 1,247 | **1,064** |
   | TQ-4bit† | 0.887 | — | 18,075 | — | 2,827 |
   | TQ-4bit+5×rsc† | 0.986 | — | 18,341 | — | 2,802 |
   | TQ-8bit† | 0.941 | — | 1,237 | — | 5,176 |
   
   † Latency/merge omitted — int overflow caused force merge failure and 
multi-segment index. Recall is largely valid.
   
   TQ-1bit+5×rescore (0.928) matches Float32 (0.929) at 19× less storage, with 
2.2× faster indexing (30K docs/s vs Float32's 13.5K). ForceMerge is slower in 
this run (no byte-copy merge yet) — the re-run with byte-copy should improve 
this significantly.
   
   ## Addressing the discussion points
   
   **On LUT scoring performance (@tveasey, @mccullocht):** I handle each bit 
width differently. 1-bit uses weighted popcount via `Integer.bitCount()` (which 
JITs to hardware `popcnt`/NEON `cnt`) — no per-dimension LUT gather, just two 
centroid constants and bit-plane accumulation. 8-bit quantizes both query and 
centroid LUT to int8 for SIMD dot product via the Vector API (`ByteVector` → 
`IntVector` widening multiply-accumulate). 2-bit and 4-bit currently use scalar 
nibble extraction — dedicated SIMD paths are planned. My focus has been mostly 
on 1-bit where TQ's compression advantage is highest.
   
   **On TurboQuantMSE vs TurboQuantProd (@mccullocht):** My Initial tests 
showed QJL correction (TurboQuantProd) was expensive and at least on smaller 
datasets and higher bit widths — even at qjlBits=1024 — did not improve recall 
much. I reverted it for now, but this needs to be explored further, especially 
at 1-bit on larger datasets where the correction might matter more.
   
   **On byte-copy merge (@mikemccand, @mccullocht):** Implemented. All segments 
share the same rotation seed, so same-codec merge copies packed bytes directly 
via `MemorySegment.copy` — no re-quantization. I also found and fixed an int 
overflow bug in the merge path (`new byte[numVecs * bytesPerVec]` overflows at 
5.1GB) — currently validating the fix in the benchmark re-runs.
   
   **On ROC curves (@mikemccand):** The dimension scaling table above includes 
latency alongside recall for each method. I haven't done a full overSample 
sweep yet but plan to.
   
   ## Benchmarks in progress
   
   After fixing the merge overflow bug and implementing byte-copy merge, I'm 
re-running everything to get clean numbers:
   
   | Run | Dataset | What it proves |
   |-----|---------|---------------|
   | **Cohere 5M v5** | 5M × 1024d Wikipedia | Clean TQ-4bit/8bit numbers — v3 
had a merge overflow bug that left these as multi-segment indices with 
unreliable results. Also includes byte-copy merge for faster forceMerge times. |
   | **ASIN 1M v2** | 1M × 4096d product embeddings | High-dimension real-world 
data (Qwen3-8B). The v1 run had TQ-8bit merge overflow — this will give correct 
TQ-8bit numbers at 4096d for the first time. |
   | **Cohere 5M v4** | 5M × 1024d Wikipedia | Same as v5 but without byte-copy 
merge — provides a comparison point for merge optimization impact. |
   
   Will update this thread with full results as they complete. Please feel free 
to ask for any other benchmark/comparison and I can include them in future runs.
   
   ## What I'd like to contribute
   
   1. **1-bit encoding** — not in the current PR (which starts at 2-bit). At 
high dimensions, 1-bit is so effective that higher bit widths may not even be 
necessary — 20-30× compression, popcount-based SIMD scorer, and as shown above 
viable recall with rescoring on multiple datasets that seems to improve with 
more dimensions
   2. **Centroid LUT rescore** — the key to eliminating float32 storage and 
getting true compression
   3. **Byte-copy merge implementation** with overflow fixes (under testing)
   4. **SIMD scorer paths** for 1-bit (popcount) and 8-bit (int8 dot product)
   5. **Large-scale benchmarks** (5M Cohere, 1M ASIN 4096d, multi-dim 
scaling/analysis with MRL to know where TQ overtakes existing quantization in 
Lucene)
   
   Happy to collaborate on merging these into the existing PR or opening a 
companion PR. The implementations are complementary — @xande's has some 
advantages mine doesn't:
   - **3-bit encoding** (I only do 1/2/4/8)
   - **Block-diagonal Hadamard** for non-power-of-2 dims like 768 (I require 
power-of-2 right now)
   - **All 4 similarity functions** (DOT_PRODUCT, COSINE, EUCLIDEAN, MIP — I 
only support DOT_PRODUCT/MIP)
   
   Code: https://github.com/shbhar/lucene/tree/turboquant-v1
   
   ---
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add TurboQuant rotation-based vector quantization codec to sandbox [lucene]

Reply via email to