shbhar commented on PR #15903: URL: https://github.com/apache/lucene/pull/15903#issuecomment-4195152417
Hi all, I also work at Amazon but in Advertising (a different org from @xande's product search) where we also use Lucene heavily and I've been independently iterating on a TurboQuant implementation for the past week (also with Kiro CLI) with various tests and benchmarks but with a different approach from this PR where I focus more on 1 bit TQ & never store fp32 vectors to get the full compression benefits. I have early comparison benchmark data below and more benchmarks are still running (after some bug fixes) and I'll update as more data comes in. Branch: https://github.com/shbhar/lucene/tree/turboquant-v1 Below is a summary of the approach and current results co-authored with claude 4.6 ## Design philosophy: quantized-only storage The implementation stores only quantized data on disk — no float32 vectors alongside. The key insight is that TQ's quantization quality after FWHT rotation is so good especially at higher dimensions that float32 rescoring is unnecessary for most use cases. Instead, I rescore directly from quantized data using centroid lookup tables. For users who do need higher-fidelity rescore, this doesn't require baking float32 into the codec — they can store vectors in a separate field and use Lucene's existing `FloatVectorSimilarityValuesSource` with a rescore query. A TurboQuant-specific `ValuesSource` that rescores from a higher-bitwidth TQ field (e.g., search at 1-bit, rescore at 8-bit) is also straightforward to add. I started with the approach of baking in higher bitwidth TQ vectors for rescoring in the codec, but reverted it after realizing that search+rescore with quantized only vectors is actually viable and the choice is better left to users. The storage impact at 1M × 4096d (Qwen3-8B embeddings): | Method | Index Size | Compression | |--------|-----------|-------------| | Float32 | 15,674 MB | 1× | | BBQ-1bit (quantized + float32) | 16,178 MB | 0.97× (larger!) | | **TQ-1bit (quantized only)** | **539 MB** | **29×** | | **TQ-4bit (quantized only)** | **2,000 MB** | **7.8×** | | **TQ-8bit (quantized only)** | **3,951 MB** | **4×** | This is possible because the centroid LUT rescore operates on the same packed bytes as search — `score = queryNorm × docNorm × Σ centroid[bin[i]] × rotatedQuery[i]`. No float32 vectors needed. ## Dimension scaling: "blessing of dimensionality" 100K MS MARCO passages, Qwen3-8B, MRL-truncated to test lower dimensions. HNSW (M=32, beamWidth=100, topK=10, fanout=50). First, raw quantized recall without rescore — this isolates quantization quality during HNSW graph traversal: | Dim | Float32 | SQ-4bit | SQ-8bit | BBQ-1bit | TQ-1bit | TQ-4bit | TQ-8bit | |-----|---------|---------|---------|----------|---------|---------|---------| | 128 | 0.973 | 0.846 | 0.962 | 0.477 | 0.303 | 0.799 | 0.949 | | 512 | 0.971 | 0.822 | 0.903 | 0.632 | 0.587 | **0.878** | **0.956** | | 1024 | 0.973 | 0.853 | 0.901 | 0.721 | **0.720** | **0.907** | **0.960** | | 4096 | 0.977 | 0.804 | 0.838 | 0.722 | **0.807** | **0.929** | **0.962** | SQ degrades with dimension (0.846→0.804 at 4-bit, 0.962→0.838 at 8-bit) while TQ improves (0.303→0.807 at 1-bit, 0.799→0.929 at 4-bit). At ≥1024d, TQ-1bit already matches BBQ-1bit (0.720 vs 0.721), and TQ-4bit/8bit surpass their SQ counterparts at ≥512d. At 4096d, TQ-1bit (0.807) surpasses SQ-4bit (0.804) without any rescore. With 5× rescore, the pattern holds and latency tells the full story: | Dim | SQ-4bit+rsc | | SQ-8bit+rsc | | BBQ-1bit+rsc | | TQ-1bit+rsc | | TQ-4bit+rsc | | |-----|------|------|------|------|------|------|------|------|------|------| | | R@10 | lat | R@10 | lat | R@10 | lat | R@10 | lat | R@10 | lat | | 128 | 0.994 | 0.70ms | 0.997 | 0.78ms | 0.845 | 0.71ms | 0.621 | 0.74ms | **0.995** | 0.76ms | | 512 | 0.994 | 0.93ms | 0.997 | 1.39ms | 0.934 | 0.79ms | 0.902 | 0.96ms | **0.998** | 1.28ms | | 1024 | 0.996 | 1.29ms | 0.997 | 1.89ms | 0.964 | 0.85ms | **0.962** | 1.00ms | **0.997** | 1.88ms | | 2048 | 0.978 | 1.62ms | 0.989 | 2.87ms | 0.931 | 1.08ms | **0.990** | **1.18ms** | **0.988** | 2.67ms | | 4096 | 0.986 | 2.77ms | 0.991 | 5.06ms | 0.951 | 1.48ms | **0.997** | **1.53ms** | **0.997** | 5.19ms | At 4096d, TQ-1bit+rsc achieves the highest recall (0.997) at the lowest latency (1.53ms) — beating SQ-8bit+rsc (0.991, 5.06ms) on both recall and latency, at 30× less storage. At ≥1024d, TQ-1bit+rsc matches or exceeds BBQ-1bit+rsc, and at ≥2048d it's so strong on every axis (recall, latency, storage) that higher bit widths may not even be necessary depending on the dataset. BBQ-1bit+rsc plateaus at 0.951 because its float32 rescore can't recover from the binary quantization error at high dimensions, while TQ-1bit's centroid LUT rescore continues improving. ## Early Benchmark data: 1M ASIN Vectors, Qwen3-8B, 4096d 1M Amazon product ASINs encoded with Qwen3-Embedding-8B at native 4096 dimensions. 5K real product search queries. HNSW (M=32, beamWidth=200, topK=10, fanout=50, forceMerge to 1 segment). Note: TQ-4bit and TQ-8bit had a int overflow bug during merge path in this run that caused multi-segment indices — latency and forceMerge times for those methods were wrong and are omitted. Recall and index size should be mostly unaffected. Re-running all methods with the fix and more SQ options - will update when done (~12 hours) | Method | R@10 | Lat (ms) | Docs/s | FMerge (s) | Index MB | |--------|------|----------|--------|------------|----------| | Float32 | 0.931 | 0.84 | 5,516 | 386 | 15,674 | | BBQ-1bit | 0.771 | 0.57 | 12,986 | 418 | 16,178 | | BBQ-1bit+5×rsc | 0.977 | 1.49 | 12,847 | 419 | 16,178 | | BBQ-1bit+10×rsc | 0.987 | 2.30 | 13,463 | 432 | 16,178 | | **TQ-1bit** | **0.741** | **0.48** | **19,924** | **103** | **539** | | **TQ-1bit+5×rsc** | **0.968** | **1.56** | **19,966** | **283** | **539** | | **TQ-1bit+10×rsc** | **0.985** | **2.43** | **20,054** | **205** | **539** | | TQ-4bit† | 0.856 | — | 8,386 | — | 2,000 | | TQ-4bit+5×rsc† | 0.975 | — | 8,252 | — | 2,000 | | TQ-8bit† | 0.946 | — | 6,825 | — | 3,951 | | TQ-8bit+5×rsc† | 0.984 | — | 6,680 | — | 3,951 | | SQ-4bit+5×rsc | 0.977 | 2.39 | 9,276 | 509 | 17,642 | † Latency/merge omitted — int overflow caused multi-segment index. Recall is largely valid. TQ-1bit+10×rsc (0.985) matches BBQ-1bit+10×rsc (0.987) at 30× less storage (539 MB vs 16,178 MB), with 1.5× faster indexing (20K docs/s vs BBQ's 13K) and 4× faster forceMerge (103s vs BBQ's 418s). ## Early Benchmark data: 5M Cohere Wikipedia, 1024d From a previous run where I had reliable TQ-1bit numbers (but TQ-4bit/8bit were affected by the same int overflow bug at this scale — fixed and re-running this one too, will update): | Method | R@10 | Latency | Docs/s | FMerge (s) | Index MB | |--------|------|---------|--------|------------|----------| | Float32 | 0.929 | 1.51ms | 13,543 | 2,194 | 20,021 | | SQ-4bit | 0.854 | 0.82ms | 19,553 | 1,310 | 22,540 | | SQ-4bit+5×rsc | 0.984 | 2.90ms | 19,323 | 1,303 | 22,540 | | SQ-8bit | 0.917 | 1.17ms | 15,769 | 1,737 | 24,979 | | SQ-8bit+5×rsc | 0.984 | 3.92ms | 15,689 | 1,759 | 24,981 | | BBQ-1bit | 0.636 | 0.69ms | 23,384 | 1,170 | 20,743 | | BBQ-1bit+5×rsc | 0.945 | 2.37ms | 22,919 | 1,150 | 20,744 | | **TQ-1bit** | **0.605** | **0.68ms** | 30,381 | 1,748 | **1,063** | | **TQ-1bit+5×rsc** | **0.928** | **2.49ms** | 30,460 | 1,247 | **1,064** | | TQ-4bit† | 0.887 | — | 18,075 | — | 2,827 | | TQ-4bit+5×rsc† | 0.986 | — | 18,341 | — | 2,802 | | TQ-8bit† | 0.941 | — | 1,237 | — | 5,176 | † Latency/merge omitted — int overflow caused force merge failure and multi-segment index. Recall is largely valid. TQ-1bit+5×rescore (0.928) matches Float32 (0.929) at 19× less storage, with 2.2× faster indexing (30K docs/s vs Float32's 13.5K). ForceMerge is slower in this run (no byte-copy merge yet) — the re-run with byte-copy should improve this significantly. ## Addressing the discussion points **On LUT scoring performance (@tveasey, @mccullocht):** I handle each bit width differently. 1-bit uses weighted popcount via `Integer.bitCount()` (which JITs to hardware `popcnt`/NEON `cnt`) — no per-dimension LUT gather, just two centroid constants and bit-plane accumulation. 8-bit quantizes both query and centroid LUT to int8 for SIMD dot product via the Vector API (`ByteVector` → `IntVector` widening multiply-accumulate). 2-bit and 4-bit currently use scalar nibble extraction — dedicated SIMD paths are planned. My focus has been mostly on 1-bit where TQ's compression advantage is highest. **On TurboQuantMSE vs TurboQuantProd (@mccullocht):** My Initial tests showed QJL correction (TurboQuantProd) was expensive and at least on smaller datasets and higher bit widths — even at qjlBits=1024 — did not improve recall much. I reverted it for now, but this needs to be explored further, especially at 1-bit on larger datasets where the correction might matter more. **On byte-copy merge (@mikemccand, @mccullocht):** Implemented. All segments share the same rotation seed, so same-codec merge copies packed bytes directly via `MemorySegment.copy` — no re-quantization. I also found and fixed an int overflow bug in the merge path (`new byte[numVecs * bytesPerVec]` overflows at 5.1GB) — currently validating the fix in the benchmark re-runs. **On ROC curves (@mikemccand):** The dimension scaling table above includes latency alongside recall for each method. I haven't done a full overSample sweep yet but plan to. ## Benchmarks in progress After fixing the merge overflow bug and implementing byte-copy merge, I'm re-running everything to get clean numbers: | Run | Dataset | What it proves | |-----|---------|---------------| | **Cohere 5M v5** | 5M × 1024d Wikipedia | Clean TQ-4bit/8bit numbers — v3 had a merge overflow bug that left these as multi-segment indices with unreliable results. Also includes byte-copy merge for faster forceMerge times. | | **ASIN 1M v2** | 1M × 4096d product embeddings | High-dimension real-world data (Qwen3-8B). The v1 run had TQ-8bit merge overflow — this will give correct TQ-8bit numbers at 4096d for the first time. | | **Cohere 5M v4** | 5M × 1024d Wikipedia | Same as v5 but without byte-copy merge — provides a comparison point for merge optimization impact. | Will update this thread with full results as they complete. Please feel free to ask for any other benchmark/comparison and I can include them in future runs. ## What I'd like to contribute 1. **1-bit encoding** — not in the current PR (which starts at 2-bit). At high dimensions, 1-bit is so effective that higher bit widths may not even be necessary — 20-30× compression, popcount-based SIMD scorer, and as shown above viable recall with rescoring on multiple datasets that seems to improve with more dimensions 2. **Centroid LUT rescore** — the key to eliminating float32 storage and getting true compression 3. **Byte-copy merge implementation** with overflow fixes (under testing) 4. **SIMD scorer paths** for 1-bit (popcount) and 8-bit (int8 dot product) 5. **Large-scale benchmarks** (5M Cohere, 1M ASIN 4096d, multi-dim scaling/analysis with MRL to know where TQ overtakes existing quantization in Lucene) Happy to collaborate on merging these into the existing PR or opening a companion PR. The implementations are complementary — @xande's has some advantages mine doesn't: - **3-bit encoding** (I only do 1/2/4/8) - **Block-diagonal Hadamard** for non-power-of-2 dims like 768 (I require power-of-2 right now) - **All 4 similarity functions** (DOT_PRODUCT, COSINE, EUCLIDEAN, MIP — I only support DOT_PRODUCT/MIP) Code: https://github.com/shbhar/lucene/tree/turboquant-v1 --- -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
