kaivalnp commented on PR #15979: URL: https://github.com/apache/lucene/pull/15979#issuecomment-4349710634
I observed a benchmark issue in the `luceneutil` KNN benchmark where `force_merge(s)` was noisy -- the [indexing flow](https://github.com/mikemccand/luceneutil/blob/c530a720329bba774fefdadd17e027187845d100/src/main/knn/KnnIndexer.java#L63) in that script is split into three phases: 1. initial indexing 2. wait for running merges to finish 3. force merge The `index(s)` column reports (1); `force_merge(s)` reports (3); but (2) is untracked! Merges are selected by Lucene, and one run can do more / less work than another -- leading to un-comparable `force_merge(s)`? To work around this, I simply moved [this line](https://github.com/mikemccand/luceneutil/blob/c530a720329bba774fefdadd17e027187845d100/src/main/knn/KnnIndexer.java#L257) to after `waitForMergesWithStatus` to include both (1) and (2) in `index(s)`. Can we look at the sum of `index(s)` and `force_merge(s)` as a "total indexing time" proxy? cc @mikemccand --- #### Benchmarks I ran a `luceneutil` KNN benchmark with Cohere v3 vectors, 1024d, `DOT_PRODUCT` similarity, 1M documents, 10K queries, no quantization. To benchmark this PR, we need to replace the format [here](https://github.com/apache/lucene/blob/23d400cabd8e63d26d594ec94e866ef5e22d2a12/lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsFormat.java#L153) with `DedupFlatVectorsFormat`. **Note**: Before the work-around listed above, `index(s)` was very close in `main` and this PR, but the untracked running merges were costlier in this PR, leading to artificially lower `force_merge(s)` (as visible below). The numbers below _do include_ the work-around that considers initial indexing + wait for running merges in `index(s)`. For this benchmark I'm approximating `index(s)` + `force_merge(s)` as "total indexing time" for comparison. `main` ``` recall latency(ms) netCPU avgCpuCount visited index(s) index_docs/s force_merge(s) index_size(MB) filterStrategy filterSelectivity 0.973 6.065 6.058 0.999 16735 483.72 2067.31 178.37 4055.40 query-time-pre-filter 0.50 0.968 5.805 5.804 1.000 16195 483.72 2067.31 178.37 4055.40 query-time-pre-filter 0.20 0.971 6.120 6.119 1.000 12687 483.72 2067.31 178.37 4055.40 query-time-pre-filter 0.10 0.988 3.103 3.101 0.999 8118 552.98 1808.40 392.23 6077.95 index-time-filter 0.50 0.991 2.468 2.467 1.000 7412 485.77 2058.59 276.12 4862.36 index-time-filter 0.20 0.993 2.218 2.217 1.000 6818 428.49 2333.77 281.16 4457.81 index-time-filter 0.10 ``` This PR ``` recall latency(ms) netCPU avgCpuCount visited index(s) index_docs/s force_merge(s) index_size(MB) filterStrategy filterSelectivity 0.973 6.240 6.238 1.000 16709 592.26 1688.44 80.70 4058.20 query-time-pre-filter 0.50 0.968 6.094 6.092 1.000 16102 592.26 1688.44 80.70 4058.20 query-time-pre-filter 0.20 0.971 6.369 6.367 1.000 12611 592.26 1688.44 80.70 4058.20 query-time-pre-filter 0.10 0.988 2.815 2.811 0.999 8125 775.38 1289.70 239.68 4130.06 index-time-filter 0.50 0.991 2.161 2.160 0.999 7413 441.66 2264.17 291.74 4085.13 index-time-filter 0.20 0.993 2.417 2.416 1.000 6827 392.37 2548.60 315.08 4070.74 index-time-filter 0.10 ``` #### Summary 1. Sanity check: `recall` is exactly the same on `main` and this PR. 2. `latency(ms)` is in the same range too: some runs are higher / lower, but overall appears to be in the range of HNSW noise? 3. As a recap of the options used: 1. `filterStrategy = query-time-pre-filter` with `filterSelectivity = F` applies a filter during graph traversal that matches `F` proportion of docs. **Note** that the reported latency does NOT include time spent in `BitSet` creation for the filter; only reports graph search time. 2. `filterStrategy = index-time-filter` (added in https://github.com/mikemccand/luceneutil/pull/468) with `filterSelectivity = F` indexes the same docs (`F` proportion) into a separate vector field, which is then used at search time. There is no separate query-time cost, apart from the user resolving to the correct field based on the filter themselves. 3. The difference b/w `query-time-pre-filter` and `index-time-filter` is \~2-3x, and demonstrates the tradeoffs nicely: more work during indexing to create a separate HNSW graph, for faster filtered search. 4. "Total indexing time" (`index(s) + force_merge(s)`) change: 1. `662.09 s` -> `672.96 s` (+1.6%) for `query-time-pre-filter` (i.e. no explicit duplicate vector field). 2. `945.21 s` -> `1015.06 s` (+7.4%) for `index-time-filter` with `50%` docs in the smaller field. 3. `761.89 s` -> `733.4 s` (-3.7%) for `index-time-filter` with `20%` docs in the smaller field. 4. `709.65 s` -> `707.45 s` (-0.3%) for `index-time-filter` with `10%` docs in the smaller field. 5. All-in-all, there is <10% difference with this PR (slower in most cases, occasionally faster, likely due to indexing non-determinism). 5. There is a sharp reduction in `index_size(MB)` with this PR as expected: only unique vectors are stored on disk (e.g. \~4GB index size without separate vector field -> \~6GB with 50% duplicates, which is reclaimed with this PR -- main additional disk usage is for the HNSW graph). As a next step, I'll increase the duplication factor: up to TK fields, each containing a (different) proportion `P` of docs of the main field (to simulate a practical scenario of the proposal in https://github.com/apache/lucene/issues/14758). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
