github-actions[bot] commented on code in PR #59847:
URL: https://github.com/apache/doris/pull/59847#discussion_r2943855793
##########
be/src/storage/index/inverted/query_v2/weight.h:
##########
@@ -63,6 +66,34 @@ class Weight {
return scorer(context);
Review Comment:
**[Design Concern] Mutual recursion between `scorer` overloads**
`scorer(ctx)` calls `scorer(ctx, {})` which calls `scorer(ctx)` — infinite
recursion if a subclass fails to override at least one. Currently all concrete
subclasses break the cycle, but this is fragile with no compile-time safety net.
Consider either:
- Making both pure virtual, or
- Having both delegate to a single protected pure-virtual `do_scorer(ctx,
key)` method
This would turn a runtime stack overflow into a compile-time error for new
Weight subclasses.
##########
be/src/storage/index/inverted/query_v2/term_query/term_weight.h:
##########
@@ -36,25 +43,43 @@ class TermWeight : public Weight {
~TermWeight() override = default;
ScorerPtr scorer(const QueryExecutionContext& ctx, const std::string&
binding_key) override {
+ auto result = specialized_scorer(ctx, binding_key);
+ return std::visit([](auto&& sc) -> ScorerPtr { return sc; }, result);
+ }
+
+ template <typename Callback>
+ void for_each_pruning(const QueryExecutionContext& context, const
std::string& binding_key,
+ float threshold, Callback&& callback) {
+ auto result = specialized_scorer(context, binding_key);
Review Comment:
**[Design Concern] Template `for_each_pruning` unreachable via virtual
dispatch**
This template `for_each_pruning` with `Callback&&` is **not** an override of
the virtual `Weight::for_each_pruning(..., PruningCallback)`. When
`weight->for_each_pruning(ctx, key, threshold, callback)` is called through a
`Weight*` pointer (as in `top_k_collector.cpp:40`), virtual dispatch invokes
the base class default which falls back to linear `for_each_pruning_scorer` —
bypassing WAND.
For **multi-term** queries this is fine because `OccurBooleanWeight`
overrides `for_each_pruning` and uses `complex_scorer()` + `block_wand()`
directly. But for a **standalone single-term** query dispatched through the
`Weight` interface, WAND block-level pruning is silently skipped.
If single-term WAND is desired, `TermWeight` should override the virtual
`for_each_pruning` methods directly (not as a template).
##########
be/src/storage/index/inverted/query_v2/boolean_query/occur_boolean_weight.h:
##########
@@ -101,4 +109,35 @@ class OccurBooleanWeight : public Weight {
uint32_t _max_doc = 0;
};
+template <typename ScoreCombinerPtrT>
+void OccurBooleanWeight<ScoreCombinerPtrT>::for_each_pruning(const
QueryExecutionContext& context,
+ float threshold,
+ PruningCallback
callback) {
+ for_each_pruning(context, {}, threshold, std::move(callback));
+}
+
+template <typename ScoreCombinerPtrT>
+void OccurBooleanWeight<ScoreCombinerPtrT>::for_each_pruning(const
QueryExecutionContext& context,
+ const
std::string& binding_key,
+ float threshold,
+ PruningCallback
callback) {
+ if (_sub_weights.empty()) {
+ return;
+ }
+
+ _max_doc = context.segment_num_rows;
+ auto specialized = complex_scorer(context, _score_combiner, binding_key);
+
+ std::visit(
+ [&](auto&& arg) {
+ using T = std::decay_t<decltype(arg)>;
+ if constexpr (std::is_same_v<T, std::vector<TermScorerPtr>>) {
+ block_wand(std::move(arg), threshold, std::move(callback));
+ } else {
+ for_each_pruning_scorer(std::move(arg), threshold,
std::move(callback));
+ }
+ },
+ std::move(specialized));
+}
+
} // namespace doris::segment_v2::inverted_index::query_v2
Review Comment:
**[Nit]** Missing newline at end of file.
##########
be/src/storage/index/inverted/query_v2/collect/top_k_collector.cpp:
##########
@@ -0,0 +1,61 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "storage/index/inverted/query_v2/collect/top_k_collector.h"
+
+#include "storage/index/inverted/query_v2/collect/multi_segment_util.h"
+
+namespace doris::segment_v2::inverted_index::query_v2 {
+
+void collect_multi_segment_top_k(const WeightPtr& weight, const
QueryExecutionContext& context,
+ const std::string& binding_key, size_t k,
+ const std::shared_ptr<roaring::Roaring>&
roaring,
+ const CollectionSimilarityPtr& similarity,
bool use_wand) {
+ TopKCollector final_collector(k);
+
+ for_each_index_segment(
+ context, binding_key, [&](const QueryExecutionContext& seg_ctx,
uint32_t seg_base) {
+ float initial_threshold = final_collector.threshold();
+
+ TopKCollector seg_collector(k);
+ auto callback = [&seg_collector](uint32_t doc_id, float score)
-> float {
+ return seg_collector.collect(doc_id, score);
Review Comment:
**[Performance Bug] Cross-segment threshold regression**
The `seg_collector` starts fresh each segment with `_threshold = -infinity`.
When the WAND callback calls `seg_collector.collect(doc_id, score)`, it returns
`seg_collector._threshold`, which is `-infinity` until the per-segment
collector has accumulated `k` items.
This means the cross-segment `initial_threshold` passed to WAND (line 40) is
immediately overwritten on the first match:
1. WAND starts with `threshold = initial_threshold` (e.g., 5.0 from previous
segments)
2. First doc scores 6.0 > 5.0, so callback is invoked
3. Callback returns `seg_collector._threshold` = `-infinity` (only 1 item,
need k for threshold to rise)
4. WAND's threshold drops to `-infinity`, losing all cross-segment pruning
benefit
The same issue affects the non-WAND path via `for_each_pruning_scorer`.
Fix: the callback should preserve the cross-segment floor:
```cpp
auto callback = [&seg_collector, initial_threshold](uint32_t doc_id, float
score) -> float {
return std::max(initial_threshold, seg_collector.collect(doc_id, score));
};
```
Alternatively, add a `set_initial_threshold(float)` method to
`TopKCollector` to pre-seed the threshold.
##########
be/src/storage/index/inverted/query_v2/segment_postings.h:
##########
@@ -155,14 +165,63 @@ class SegmentPostings final : public Postings {
bool scoring_enabled() const { return _enable_scoring; }
+ int64_t block_id() const { return _block_id; }
+
+ void seek_block(uint32_t target_doc) {
+ if (target_doc <= _doc) {
+ return;
+ }
+ if (_raw_iter->skipToBlock(target_doc)) {
+ _block_max_score_cache = -1.0F;
+ _cursor = 0;
+ _block.doc_many_size_ = 0;
+ }
+ }
+
+ uint32_t last_doc_in_block() const {
+ int32_t last_doc = _raw_iter->getLastDocInBlock();
+ if (last_doc == -1 || last_doc == 0x7FFFFFFFL) {
+ return TERMINATED;
+ }
+ return static_cast<uint32_t>(last_doc);
+ }
+
+ float block_max_score() {
+ if (!_enable_scoring || !_similarity) {
+ return std::numeric_limits<float>::max();
+ }
+ if (_block_max_score_cache >= 0.0F) {
+ return _block_max_score_cache;
+ }
+ int32_t max_block_freq = _raw_iter->getMaxBlockFreq();
+ int32_t max_block_norm = _raw_iter->getMaxBlockNorm();
Review Comment:
**[Minor] `block_max_score()` and `max_score()` return `float::max()` when
scoring disabled**
Returning `numeric_limits<float>::max()` when scoring is disabled means:
- Single-scorer WAND: `block_max_score() < threshold` is never true → no
block skipping (correct but defeats purpose)
- Multi-scorer WAND: one scorer returning `float::max()` causes
`block_max_score_upperbound` sum to overflow → pivot selection breaks, no
pruning
Since WAND should only be invoked when scoring is enabled, this is safe in
practice. But consider adding a `DCHECK(_enable_scoring && _similarity)`
assertion to catch misuse early, rather than silently degrading to linear scan.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]