[jira] [Commented] (KYLIN-6055) Empty `dimension_range_info_map` After Building Model with Secondary Partition Causes Query to Fail Filtering by Dimension

Guoliang Sun (Jira) Tue, 25 Feb 2025 01:51:21 -0800


    [ 
https://issues.apache.org/jira/browse/KYLIN-6055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17930189#comment-17930189
 ]


Guoliang Sun commented on KYLIN-6055:
-------------------------------------

h1. Dev Design


Scope Clarification: With multi-level partitioning enabled, only support 
`dimension_range_info_map` at the first-level time partition column (used for 
query filtering via min/max data). True multi-level partition column support is 
not included.  

Limitation: Cases involving deleted partitions are not considered.  
h2. Original Design


1. First-Level Time Partition Column Calculation Logic:  
   - Build: Calculate min/max directly from flat table data (see 
`SegmentExec#calDimRange`).  
   - Merge: Merge logic in `MergeStage#mergeDimRange`.  
     - If flat table data exists, calculate min/max directly.  
     - If no flat table data exists, check `dimension_range_info_map` in 
Segments:  
       - If any unmerged Segment lacks `dimension_range_info_map`, skip 
calculation (no min/max info means longer query time but no data loss).  
       - If all unmerged Segments have `dimension_range_info_map`, extract 
min/max for each column and apply data type-specific logic.  
   - Metadata Update: Replace old Segments with new ones during updates; no 
extra `dimension_range_info_map` update required (see 
`AfterMergeOrRefreshResourceMerger#merge`).  
     - For Segment Partition updates, aggregate related data (e.g., 
`column_source_bytes`) and set updates (see 
`MetadataMerger#upsertSegmentPartition`). Multi-level partition column 
scenarios were unsupported, so no `dimension_range_info_map` updates were 
performed.  
h2. New Design
1. New Parameter: `kylin.build.multi-partition-filter-enabled` (default: 
`false`, feature disabled).  
2. Build:  
   - Retain original logic but remove the 
`Objects.isNull(segment.getModel.getMultiPartitionDesc)` check to enable 
calculations even with multi-level partitions.  
   - Subclasses reuse the logic, then perform the first merge with the 
`dimension_range_info_map` stored in the Segment. During merge, take the 
smallest and largest values, comparing them based on column data types (columns 
must be defined in the model; otherwise, skip comparison to avoid incorrect 
filtering). Missing min/max may reduce filtering but won’t cause data loss.  
3. Merge:  
   - No concept of partition merging; only Segment-level merging exists. Reuse 
Segment merge logic to calculate min/max for `dimension_range_info_map` columns 
in the merged Segment and write to the JSON file. No additional logic is needed 
in `PartitionMergeStage`.  
4. Metadata Update:  
   - Extra setting and updating of `dimension_range_info_map` is required. 
Considering multiple subtasks (e.g., multiple partitions modifying the 
`dimension_range_info_map` in the Segment JSON file), retrieve the previously 
stored `dimension_range_info_map` and perform a second merge. This merge 
operation locks the Segment JSON file path, preventing other tasks from 
reading/writing until the update is complete.  
   - Once `dimension_range_info_map` is written to the Segment’s JSON file, 
queries can filter based on column min/max info as before—no special changes 
are needed.

> Empty `dimension_range_info_map` After Building Model with Secondary 
> Partition Causes Query to Fail Filtering by Dimension
> --------------------------------------------------------------------------------------------------------------------------
>
>                 Key: KYLIN-6055
>                 URL: https://issues.apache.org/jira/browse/KYLIN-6055
>             Project: Kylin
>          Issue Type: Bug
>    Affects Versions: 5.0.0
>            Reporter: Guoliang Sun
>            Assignee: Guoliang Sun
>            Priority: Major
>             Fix For: 5.0.2
>
>         Attachments: image-2025-02-25-16-56-39-917.png, 
> image-2025-02-25-16-57-04-319.png
>
>
> For the same model without secondary partitioning, the dimension min/max in 
> `dimension_range_info_map` is generated correctly after building.
> !image-2025-02-25-16-56-39-917.png|width=427,height=263!
> For the same model with secondary partitioning, the dimension min/max in 
> `dimension_range_info_map` is not generated correctly after building.
> !image-2025-02-25-16-57-04-319.png|width=446,height=190!
> Impact: For query scenarios with only regular dimension filters and no 
> partition column filters, the query will scan the entire data after model 
> building, and query performance cannot be guaranteed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KYLIN-6055) Empty `dimension_range_info_map` After Building Model with Secondary Partition Causes Query to Fail Filtering by Dimension

Reply via email to