shbhar commented on PR #15903: URL: https://github.com/apache/lucene/pull/15903#issuecomment-4200938919
Here are the updated results after some fixes - TQ8Bit is also comparable to SQ8bit now, but TQ4bit is still slower than SQ4bit (but at comparable recall and much smaller index size). Changes from last run: 1. Merge int overflow fix 2. Byte copy merge 3. int8 SIMD scorer for TQ8Bit ## Benchmark data: 5M Cohere Wikipedia, 1024d 5M Cohere Wikipedia vectors at 1024 dimensions. HNSW (M=32, beamWidth=100, topK=10, fanout=50, forceMerge to 1 segment). | Method | R@10 | Latency | Docs/s | FMerge (s) | Index MB | |--------|------|---------|--------|------------|----------| | Float32 | 0.928 | 1.60ms | 11,881 | 2,267 | 20,020 | | SQ-4bit | 0.855 | 0.86ms | 19,025 | 1,347 | 22,538 | | SQ-4bit+5×rsc | 0.986 | 3.17ms | 19,133 | 1,361 | 22,539 | | SQ-8bit | 0.918 | 1.23ms | 15,784 | 1,791 | 24,980 | | SQ-8bit+5×rsc | 0.987 | 4.13ms | 15,318 | 1,776 | 24,979 | | BBQ-1bit | 0.631 | 0.82ms | 22,840 | 1,208 | 20,743 | | BBQ-1bit+5×rsc | 0.944 | 2.59ms | 23,257 | 1,215 | 20,744 | | **TQ-1bit** | **0.608** | **0.77ms** | 30,381 | 1,897 | **1,064** | | **TQ-1bit+5×rsc** | **0.928** | **2.83ms** | 30,405 | 1,532 | **1,066** | | **TQ-4bit** | **0.852** | **1.19ms** | 17,915 | 2,459 | **2,851** | | **TQ-4bit+5×rsc** | **0.983** | **3.91ms** | 17,195 | 3,858 | **2,849** | | **TQ-8bit** | **0.902** | **0.94ms** | **19,369** | **2,659** | **5,293** | | **TQ-8bit+5×rsc** | **0.983** | **3.13ms** | **18,471** | **3,110** | **5,293** | TQ-1bit vs BBQ-1bit: TQ-1bit (0.608) nearly matches BBQ-1bit (0.631) raw recall, but at 19× less storage (1,064 MB vs 20,743 MB) and 1.3× faster indexing (30K vs 23K docs/s). With 5× rescore, TQ-1bit+rsc (0.928) nearly matches BBQ-1bit+rsc (0.944) — the gap narrows further at higher dimensions (see ASIN 4096d below). TQ-8bit vs SQ-8bit: TQ-8bit (0.902) nearly matches SQ-8bit (0.918) raw recall at 0.94ms vs 1.23ms latency (1.3× faster), with 4.7× less storage (5,293 MB vs 24,980 MB). With 5× rescore, TQ-8bit+rsc (0.983) nearly matches SQ-8bit+rsc (0.987) at 24% less latency (3.13ms vs 4.13ms). ## Benchmark data: 1M ASIN Vectors, Qwen3-8B, 4096d 1M Amazon product ASINs encoded with Qwen3-Embedding-8B at native 4096 dimensions. 5K real product search queries. HNSW (M=32, beamWidth=200, topK=10, fanout=50, forceMerge to 1 segment). | Method | R@10 | Lat (ms) | Docs/s | FMerge (s) | Index MB | |--------|------|----------|--------|------------|----------| | Float32 | 0.925 | 0.85 | 5,430 | 389 | 15,674 | | SQ-4bit | 0.883 | 0.84 | 9,287 | 504 | 17,642 | | SQ-4bit+5×rsc | 0.978 | 2.47 | 9,049 | 512 | 17,642 | | SQ-8bit | 0.902 | 1.31 | 6,752 | 680 | 19,595 | | SQ-8bit+5×rsc | 0.980 | 3.91 | 6,947 | 672 | 19,595 | | BBQ-1bit | 0.774 | 0.56 | 13,200 | 417 | 16,178 | | BBQ-1bit+5×rsc | 0.976 | 1.53 | 13,235 | 419 | 16,178 | | BBQ-1bit+10×rsc | 0.987 | 2.26 | 13,120 | 422 | 16,178 | | **TQ-1bit** | **0.741** | **0.49** | **20,020** | **210** | **539** | | **TQ-1bit+5×rsc** | **0.970** | **1.58** | **19,376** | **353** | **539** | | **TQ-1bit+10×rsc** | **0.984** | **2.47** | **19,460** | **352** | **538** | | **TQ-4bit** | **0.866** | **1.33** | **8,226** | **1,397** | **2,000** | | **TQ-4bit+5×rsc** | **0.974** | **4.18** | **8,181** | **1,409** | **2,000** | | **TQ-8bit** | **0.908** | **0.94** | **10,537** | **667** | **3,954** | | **TQ-8bit+5×rsc** | **0.974** | **2.89** | **10,564** | **724** | **3,954** | TQ-8bit beats SQ-8bit on every axis at 4096d: higher recall (0.908 vs 0.902), lower latency (0.94ms vs 1.31ms), faster indexing (10.5K vs 6.8K docs/s), comparable merge time (667s vs 680s), and 5× smaller index (3,954 MB vs 19,595 MB). TQ-1bit+10×rsc (0.984) matches BBQ-1bit+10×rsc (0.987) at 30× less storage (538 MB vs 16,178 MB), with 1.5× faster indexing and 2× faster merge. If anyone wants to replicate these results: lucene: https://github.com/shbhar/lucene/tree/turboquant-v1 (commit for these tests: 62cce045b61556484517542e43e6c0c7ddfec8ee) luceneutil: https://github.com/shbhar/luceneutil/tree/turboquant-v1 (hacky - make sure to run fp32 as the first one as tq ground truth depends on that index) - commit for this test: 911b947dab95a6164ba38c875eca5a1d72298b3c -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
