[ https://issues.apache.org/jira/browse/KYLIN-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shaofeng SHI updated KYLIN-3122: -------------------------------- Fix Version/s: v2.4.0 > Partition elimination algorithm seems to be inefficient and have serious > issues with handling date/time ranges, can lead to very slow queries and > OOM/Java heap dump conditions > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: KYLIN-3122 > URL: https://issues.apache.org/jira/browse/KYLIN-3122 > Project: Kylin > Issue Type: Bug > Components: Query Engine > Affects Versions: v2.2.0 > Environment: HDP 2.5.6, Kylin 2.2.0 > Reporter: Vsevolod Ostapenko > Assignee: hongbin ma > Priority: Critical > Fix For: v2.4.0 > > Attachments: partition_elimination_bug_single_column_test.log > > > Current algorithm of cube segment elimination seems to be rather inefficient. > We are using a model where cubes are partitioned by date and time: > "partition_desc": > { "partition_date_column": "A_VL_HOURLY_V.THEDATE", "partition_time_column": > "A_VL_HOURLY_V.THEHOUR", "partition_date_start": 0, "partition_date_format": > "yyyyMMdd", "partition_time_format": "HH", "partition_type": "APPEND", > "partition_condition_builder": > "org.apache.kylin.metadata.model.PartitionDesc$DefaultPartitionConditionBuilder" > } > , > Cubes contain partitions for multiple days and 24 hours for each day. Each > cube segment corresponds to just one hour. > When a query is issued where both date and hour are specified using equality > condition (e.g. thedate = '20171011' and thehour = '10') Kylin sequentially > integrates over all the segment cubes (hundreds of them) only to skip all > except for the one that needs to be scanned (which can be observed by looking > in the logs). > The expectation is that Kylin would use existing info on the partitioning > columns (date and time) and known hierarchical relations between date and > time to locate required partition much more efficiently that linear scan > through all the cube partitions. > Now, if filtering condition is on the range of hours, behavior of the > partition pruning and scanning becomes not very logical, which suggests bugs > in the logic. > If filtering condition is on specific date and closed-open range of hours > (e.g. thedate = '20171011' and thehour >= '10' and thehour < '11'), in > addition to sequentially scanning all the cube partitions (as described > above), Kylin will scan HBase tables for all the hours from the specified > starting hour and till the last hour of the day (e.g. from hour 10 to 24, > instead of just hour 10). > As the result query will run much longer that necessary, and might run out > of memory, causing JVM heap dump and Kylin server crash. > If filtering condition is on specific date by hour interval is specified as > open-closed (e.g. thedate = '20171011' and thehour > '09' and thehour <= > '10'), Kylin will scan all HBase tables for all the later dates and hours > (e.g. from hour 10 and till the most recent hour on the most recent day, > which can be hundreds of tables and thousands of regions). > As the result query execution will dramatically increase and in most cases > Kylin server will be terminated with OOM error and JVM heap dump. -- This message was sent by Atlassian JIRA (v7.6.3#76005)