yx-keith opened a new pull request, #64145:
URL: https://github.com/apache/doris/pull/64145

   ### What problem does this PR solve?
   
   Issue Number: close #xxx
   
   Related PR: #xxx
   
   Problem Summary:
   
   When building an ANN (vector) index, `AnnIndexColumnWriter` buffers incoming
   vectors and flushes them to the underlying FAISS index one chunk at a time
   (`ann_index_build_chunk_size`, default **1,000,000** rows). On every chunk
   flush it calls `train()` **and then** `add()`. Re-invoking `train()` once per
   chunk is incorrect:
   
   - **IVF / HNSW with a `SQ`/`PQ` quantizer (recall corruption).**
     `train()` re-fits the scalar/product-quantization codebook on whatever 
chunk
     is currently buffered. In FAISS, `IndexIVF::train()` calls 
`train_encoder()`
     with **no `is_trained` guard** 
(`contrib/faiss/faiss/IndexIVF.cpp:1166/1168`),
     and `IndexHNSWSQ`/`IndexHNSWPQ` start `is_trained=false`. Vectors from 
earlier
     chunks were already `add()`ed and **encoded using the previous codebook**.
     After the final chunk re-trains the codebook, those earlier codes no longer
     match the codebook used at query time, so they decode to the wrong 
distances.
     The result is **silent recall loss on any segment larger than one chunk**
     (> 1M rows, e.g. a compacted segment).
   
   - **Default IVFFlat / HNSW-Flat (redundant work).**
     For the default `FLAT` quantizer there is no codebook, and the coarse
     quantizer's k-means is already guarded — `Level1Quantizer::train_q1()` 
returns
     early once `quantizer->is_trained && quantizer->ntotal == nlist`
     (`contrib/faiss/faiss/IndexIVF.cpp:62`). So recall is **not** corrupted, 
but
     every extra chunk still pays a full `train()` call, including its global
     OpenMP build-budget mutex. Pure wasted work.
   
   The writer's own `finish()` comment already documented the intended contract
   ("the quantizer was already trained on previous batches, just add"), but the
   per-chunk loop never honored it.
   
   ### What's changed
   
   Train the index **at most once** per build:
   
   - Add `VectorIndex::needs_training()` (default `false`). `FaissVectorIndex`
     overrides it to return `true` for `IVF` / `IVF_ON_DISK`, or for any 
quantizer
     other than `FLAT` (i.e. `SQ4`/`SQ8`/`PQ`). HNSW + FLAT returns `false`.
   - Route both `train()` call sites in `AnnIndexColumnWriter` (the chunk-flush
     path and the `finish()` remainder path) through a single
     `_train_once_if_needed()` helper, guarded by a `_trained` flag. The first
     flush trains (when the index needs it) and latches the flag; every later 
flush
     only `add()`s.
   
   ### Behavioral impact
   
   - **IVFFlat / HNSW-Flat**: byte-for-byte identical index. The 
coarse-quantizer
     centroids were already taken from the first chunk (the `train_q1` guard
     prevented later chunks from overwriting them), so the produced index is the
     same — we just stop calling a no-op `train()` repeatedly. **No recall or
     format change.**
   - **IVF-SQ / IVF-PQ / HNSW-SQ / HNSW-PQ**: the codebook is now trained once 
(on
     the first chunk) and stays consistent with every code that gets written.
     **This fixes the recall corruption** for multi-chunk segments.
   - **Build performance**: strictly faster or equal — we remove `N-1` redundant
     `train()` calls per segment (and their OMP-budget mutex acquisitions). No 
path
     is made slower.
   - Applies uniformly to all build paths that go through 
`AnnIndexColumnWriter`:
     normal load/flush, compaction, schema change, and 
build-index-on-existing-data.
   
   ### Notes / scope
   
   - This PR is a **pure correctness + redundancy fix**. It does **not** change
     default index parameters, the on-disk format, the search path, or memory
     admission. Reducing the training-set staging memory (sampling instead of
     buffering the whole chunk) is intentionally left to a follow-up.
   - `FaissVectorIndex` is the only production `VectorIndex` subclass, and
     `AnnIndexColumnWriter` is the only code that drives `train()` during a 
build,
     so the new virtual method cannot affect any other type or path.
   
   ## Release note
   
   Fixed a bug where building an ANN index on a segment larger than
   `ann_index_build_chunk_size` (default 1,000,000 rows) with an `SQ`/`PQ`
   quantizer could silently lose recall, because the quantizer codebook was
   re-trained on every chunk. The index is now trained exactly once per build.
   
   ### Release note
   
   None
   
   ### Check List (For Author)
   
   - Test <!-- At least one of them must be included. -->
       - [ ] Regression test
       - [ ] Unit Test
       - [ ] Manual test (add detailed scripts or steps below)
       - [ ] No need to test or manual test. Explain why:
           - [ ] This is a refactor/code format and no logic has been changed.
           - [ ] Previous test can cover this change.
           - [ ] No code files have been changed.
           - [ ] Other reason <!-- Add your reason?  -->
   
   - Behavior changed:
       - [ ] No.
       - [ ] Yes. <!-- Explain the behavior change -->
   
   - Does this need documentation?
       - [ ] No.
       - [ ] Yes. <!-- Add document PR link here. eg: 
https://github.com/apache/doris-website/pull/1214 -->
   
   ### Check List (For Reviewer who merge this PR)
   
   - [ ] Confirm the release note
   - [ ] Confirm test cases
   - [ ] Confirm document
   - [ ] Add branch pick label <!-- Add branch pick label that this PR should 
merge into -->
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to