github-actions[bot] commented on code in PR #64205:
URL: https://github.com/apache/doris/pull/64205#discussion_r3425284777
##########
be/src/storage/segment/variant/variant_column_reader.cpp:
##########
@@ -882,8 +882,12 @@ Status VariantColumnReader::_build_read_plan(ReadPlan*
plan, const TabletColumn&
}
// Check if path is prefix, example sparse columns path: a.b.c, a.b.e,
access prefix: a.b.
- // Or access root path
- if (_has_prefix_path_unlocked(relative_path)) {
+ // Or access root path. If sparse stats reached the configured limit, an
exact sparse path can
+ // still have unrecorded sparse children such as a.b.c.
+ const bool has_prefix_path = _has_prefix_path_unlocked(relative_path);
+ const bool sparse_stats_may_have_unrecorded_children =
+ exceeded_sparse_column_limit && existed_in_sparse_column;
+ if (has_prefix_path || sparse_stats_may_have_unrecorded_children) {
Review Comment:
This only treats truncated stats as ambiguous when the exact target path is
recorded in sparse stats, but the same ambiguity exists when the target path is
a materialized leaf and an unrecorded sparse child exists. A concrete shape is
`variant_max_subcolumns_count = 1`, sparse stats limit = 1, with `b` having the
highest non-null count so it is kept as an extracted leaf, while overflow
sparse paths include `a` and `b.c`; `serialize_sparse_columns()` iterates the
remaining `std::map`, so the single recorded sparse stat can be `a` and `b.c`
is physically present but absent from stats. For `v['b']`, `node` is the leaf
for `b`, `_has_prefix_path_unlocked("b")` is false because there is no child
node and no recorded `b.` sparse prefix, and `existed_in_sparse_column` is
false, so this branch is skipped and `_try_build_leaf_plan()` returns only the
`b` leaf. Rows whose `b` value only comes from sparse `b.c` are not merged.
When sparse stats are truncated, a materialized exact node also n
eeds a hierarchical read unless there is a persisted proof that no sparse
child with this prefix exists.
##########
fe/fe-core/src/main/java/org/apache/doris/qe/SessionVariable.java:
##########
@@ -3597,6 +3597,7 @@ public void
checkAnnIndexCandidateRowsPercentThreshold(String value) {
@VarAttrDef.VarAttr(
name = DEFAULT_VARIANT_MAX_SPARSE_COLUMN_STATISTICS_SIZE,
needForward = true,
+ checker = "checkDefaultVariantMaxSparseColumnStatisticsSize",
fuzzy = true
Review Comment:
This checker rejects new `SET` statements, but it does not cover upgraded
metadata that was valid before this PR. `SessionVariable.readFromJson()` loads
persisted int fields directly, and `VariableMgr.replayGlobalVariableV2()`
catches and ignores checker failures, so an image/edit log containing
`default_variant_max_sparse_column_statistics_size = 0` can still leave new
sessions with value 0 after upgrade. Existing table metadata can also already
contain 0, as shown by this PR changing the nested schema-change expected
output from 0 to 1. Those zero-valued tables/sessions still reach BE, where a
writer with limit 0 records no sparse stats and the reader no longer treats
empty stats plus a sparse reader as truncated. Please add a compatibility
normalization/migration for persisted zero values, or keep a BE compatibility
branch for `limit == 0` while still rejecting new user input.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]