yx-keith opened a new pull request, #64145:
URL: https://github.com/apache/doris/pull/64145
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
When building an ANN (vector) index, `AnnIndexColumnWriter` buffers incoming
vectors and flushes them to the underlying FAISS index one chunk at a time
(`ann_index_build_chunk_size`, default **1,000,000** rows). On every chunk
flush it calls `train()` **and then** `add()`. Re-invoking `train()` once per
chunk is incorrect:
- **IVF / HNSW with a `SQ`/`PQ` quantizer (recall corruption).**
`train()` re-fits the scalar/product-quantization codebook on whatever
chunk
is currently buffered. In FAISS, `IndexIVF::train()` calls
`train_encoder()`
with **no `is_trained` guard**
(`contrib/faiss/faiss/IndexIVF.cpp:1166/1168`),
and `IndexHNSWSQ`/`IndexHNSWPQ` start `is_trained=false`. Vectors from
earlier
chunks were already `add()`ed and **encoded using the previous codebook**.
After the final chunk re-trains the codebook, those earlier codes no longer
match the codebook used at query time, so they decode to the wrong
distances.
The result is **silent recall loss on any segment larger than one chunk**
(> 1M rows, e.g. a compacted segment).
- **Default IVFFlat / HNSW-Flat (redundant work).**
For the default `FLAT` quantizer there is no codebook, and the coarse
quantizer's k-means is already guarded — `Level1Quantizer::train_q1()`
returns
early once `quantizer->is_trained && quantizer->ntotal == nlist`
(`contrib/faiss/faiss/IndexIVF.cpp:62`). So recall is **not** corrupted,
but
every extra chunk still pays a full `train()` call, including its global
OpenMP build-budget mutex. Pure wasted work.
The writer's own `finish()` comment already documented the intended contract
("the quantizer was already trained on previous batches, just add"), but the
per-chunk loop never honored it.
### What's changed
Train the index **at most once** per build:
- Add `VectorIndex::needs_training()` (default `false`). `FaissVectorIndex`
overrides it to return `true` for `IVF` / `IVF_ON_DISK`, or for any
quantizer
other than `FLAT` (i.e. `SQ4`/`SQ8`/`PQ`). HNSW + FLAT returns `false`.
- Route both `train()` call sites in `AnnIndexColumnWriter` (the chunk-flush
path and the `finish()` remainder path) through a single
`_train_once_if_needed()` helper, guarded by a `_trained` flag. The first
flush trains (when the index needs it) and latches the flag; every later
flush
only `add()`s.
### Behavioral impact
- **IVFFlat / HNSW-Flat**: byte-for-byte identical index. The
coarse-quantizer
centroids were already taken from the first chunk (the `train_q1` guard
prevented later chunks from overwriting them), so the produced index is the
same — we just stop calling a no-op `train()` repeatedly. **No recall or
format change.**
- **IVF-SQ / IVF-PQ / HNSW-SQ / HNSW-PQ**: the codebook is now trained once
(on
the first chunk) and stays consistent with every code that gets written.
**This fixes the recall corruption** for multi-chunk segments.
- **Build performance**: strictly faster or equal — we remove `N-1` redundant
`train()` calls per segment (and their OMP-budget mutex acquisitions). No
path
is made slower.
- Applies uniformly to all build paths that go through
`AnnIndexColumnWriter`:
normal load/flush, compaction, schema change, and
build-index-on-existing-data.
### Notes / scope
- This PR is a **pure correctness + redundancy fix**. It does **not** change
default index parameters, the on-disk format, the search path, or memory
admission. Reducing the training-set staging memory (sampling instead of
buffering the whole chunk) is intentionally left to a follow-up.
- `FaissVectorIndex` is the only production `VectorIndex` subclass, and
`AnnIndexColumnWriter` is the only code that drives `train()` during a
build,
so the new virtual method cannot affect any other type or path.
## Release note
Fixed a bug where building an ANN index on a segment larger than
`ann_index_build_chunk_size` (default 1,000,000 rows) with an `SQ`/`PQ`
quantizer could silently lose recall, because the quantizer codebook was
re-trained on every chunk. The index is now trained exactly once per build.
### Release note
None
### Check List (For Author)
- Test <!-- At least one of them must be included. -->
- [ ] Regression test
- [ ] Unit Test
- [ ] Manual test (add detailed scripts or steps below)
- [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
- [ ] Previous test can cover this change.
- [ ] No code files have been changed.
- [ ] Other reason <!-- Add your reason? -->
- Behavior changed:
- [ ] No.
- [ ] Yes. <!-- Explain the behavior change -->
- Does this need documentation?
- [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
https://github.com/apache/doris-website/pull/1214 -->
### Check List (For Reviewer who merge this PR)
- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR should
merge into -->
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]