Re: [PR] Add a de-duplicating flat vector format [lucene]

via GitHub Wed, 29 Apr 2026 21:34:36 -0700


kaivalnp commented on PR #15979:
URL: https://github.com/apache/lucene/pull/15979#issuecomment-4349710634


   I observed a benchmark issue in the `luceneutil` KNN benchmark where 
`force_merge(s)` was noisy -- the [indexing 
flow](https://github.com/mikemccand/luceneutil/blob/c530a720329bba774fefdadd17e027187845d100/src/main/knn/KnnIndexer.java#L63)
 in that script is split into three phases:
   1. initial indexing
   2. wait for running merges to finish
   3. force merge
   
   The `index(s)` column reports (1); `force_merge(s)` reports (3); but (2) is 
untracked!
   Merges are selected by Lucene, and one run can do more / less work than 
another -- leading to un-comparable `force_merge(s)`?
   
   To work around this, I simply moved [this 
line](https://github.com/mikemccand/luceneutil/blob/c530a720329bba774fefdadd17e027187845d100/src/main/knn/KnnIndexer.java#L257)
 to after `waitForMergesWithStatus` to include both (1) and (2) in `index(s)`.
   
   Can we look at the sum of `index(s)` and `force_merge(s)` as a "total 
indexing time" proxy?
   
   cc @mikemccand
   
   ---
   
   #### Benchmarks
   
   I ran a `luceneutil` KNN benchmark with Cohere v3 vectors, 1024d, 
`DOT_PRODUCT` similarity, 1M documents, 10K queries, no quantization.
   
   To benchmark this PR, we need to replace the format 
[here](https://github.com/apache/lucene/blob/23d400cabd8e63d26d594ec94e866ef5e22d2a12/lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsFormat.java#L153)
 with `DedupFlatVectorsFormat`.
   
   **Note**: Before the work-around listed above, `index(s)` was very close in 
`main` and this PR, but the untracked running merges were costlier in this PR, 
leading to artificially lower `force_merge(s)` (as visible below).
   
   The numbers below _do include_ the work-around that considers initial 
indexing + wait for running merges in `index(s)`. For this benchmark I'm 
approximating `index(s)` + `force_merge(s)` as "total indexing time" for 
comparison.
   
   `main`
   
   ```
   recall  latency(ms)  netCPU  avgCpuCount  visited  index(s)  index_docs/s  
force_merge(s)  index_size(MB)         filterStrategy  filterSelectivity
    0.973        6.065   6.058        0.999    16735    483.72       2067.31    
      178.37         4055.40  query-time-pre-filter               0.50
    0.968        5.805   5.804        1.000    16195    483.72       2067.31    
      178.37         4055.40  query-time-pre-filter               0.20
    0.971        6.120   6.119        1.000    12687    483.72       2067.31    
      178.37         4055.40  query-time-pre-filter               0.10
    0.988        3.103   3.101        0.999     8118    552.98       1808.40    
      392.23         6077.95      index-time-filter               0.50
    0.991        2.468   2.467        1.000     7412    485.77       2058.59    
      276.12         4862.36      index-time-filter               0.20
    0.993        2.218   2.217        1.000     6818    428.49       2333.77    
      281.16         4457.81      index-time-filter               0.10
   ```
   
   This PR
   
   ```
   recall  latency(ms)  netCPU  avgCpuCount  visited  index(s)  index_docs/s  
force_merge(s)  index_size(MB)         filterStrategy  filterSelectivity
    0.973        6.240   6.238        1.000    16709    592.26       1688.44    
       80.70         4058.20  query-time-pre-filter               0.50
    0.968        6.094   6.092        1.000    16102    592.26       1688.44    
       80.70         4058.20  query-time-pre-filter               0.20
    0.971        6.369   6.367        1.000    12611    592.26       1688.44    
       80.70         4058.20  query-time-pre-filter               0.10
    0.988        2.815   2.811        0.999     8125    775.38       1289.70    
      239.68         4130.06      index-time-filter               0.50
    0.991        2.161   2.160        0.999     7413    441.66       2264.17    
      291.74         4085.13      index-time-filter               0.20
    0.993        2.417   2.416        1.000     6827    392.37       2548.60    
      315.08         4070.74      index-time-filter               0.10
   ```
   
   #### Summary
   
   1. Sanity check: `recall` is exactly the same on `main` and this PR.
   2. `latency(ms)` is in the same range too: some runs are higher / lower, but 
overall appears to be in the range of HNSW noise?
   3. As a recap of the options used:
        1. `filterStrategy = query-time-pre-filter` with `filterSelectivity = 
F` applies a filter during graph traversal that matches `F` proportion of docs. 
**Note** that the reported latency does NOT include time spent in `BitSet` 
creation for the filter; only reports graph search time.
        2. `filterStrategy = index-time-filter` (added in 
https://github.com/mikemccand/luceneutil/pull/468) with `filterSelectivity = F` 
indexes the same docs (`F` proportion) into a separate vector field, which is 
then used at search time. There is no separate query-time cost, apart from the 
user resolving to the correct field based on the filter themselves.
        3. The difference b/w `query-time-pre-filter` and `index-time-filter` 
is \~2-3x, and demonstrates the tradeoffs nicely: more work during indexing to 
create a separate HNSW graph, for faster filtered search.
   4. "Total indexing time" (`index(s) + force_merge(s)`) change:
        1. `662.09 s` -> `672.96 s` (+1.6%) for `query-time-pre-filter` (i.e. 
no explicit duplicate vector field).
        2. `945.21 s` -> `1015.06 s` (+7.4%) for `index-time-filter` with `50%` 
docs in the smaller field.
        3. `761.89 s` -> `733.4 s` (-3.7%) for `index-time-filter` with `20%` 
docs in the smaller field.
        4. `709.65 s` -> `707.45 s` (-0.3%) for `index-time-filter` with `10%` 
docs in the smaller field.
        5. All-in-all, there is <10% difference with this PR (slower in most 
cases, occasionally faster, likely due to indexing non-determinism).
   5. There is a sharp reduction in `index_size(MB)` with this PR as expected: 
only unique vectors are stored on disk (e.g. \~4GB index size without separate 
vector field -> \~6GB with 50% duplicates, which is reclaimed with this PR -- 
main additional disk usage is for the HNSW graph).
   
   As a next step, I'll increase the duplication factor: up to TK fields, each 
containing a (different) proportion `P` of docs of the main field (to simulate 
a practical scenario of the proposal in 
https://github.com/apache/lucene/issues/14758).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add a de-duplicating flat vector format [lucene]

Reply via email to