[ https://issues.apache.org/jira/browse/HIVE-23712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138436#comment-17138436 ]
László Bodor commented on HIVE-23712: ------------------------------------- the root cause is that in [MetadataOnlyOptimizer|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/MetadataOnlyOptimizer.java#L124] the TS operator of test1 table is considered to be subject of metadata-only optimization and later [NullScanTaskDispatcher|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/NullScanTaskDispatcher.java#L106] find a non-empty folder for this partition because of ACID operations, with files: {code} hdfs://localhost:58447/build/ql/test/data/warehouse/test1/val2=bar/delete_delta_0000003_0000003_0000 hdfs://localhost:58447/build/ql/test/data/warehouse/test1/val2=bar/delta_0000002_0000002_0000 {code} not sure about the perfect solution at the moment, but maybe the following scenario should be excluded somehow from metadata-only optimization: 1. there is a partitioned table: create table test1 (id int, val string) partitioned by (val2 string) STORED AS ORC TBLPROPERTIES ('transactional'='true'); 2. in a distinct query, only the partitioned column is selected: select distinct val2, current_timestamp, 'metadata true' as query from test1; {code} in this case tsOp.getNeededColumnIDs() is empty (partition column is not present in needed columns) {code} > metadata-only queries return incorrect results with empty partition > ------------------------------------------------------------------- > > Key: HIVE-23712 > URL: https://issues.apache.org/jira/browse/HIVE-23712 > Project: Hive > Issue Type: Bug > Reporter: László Bodor > Assignee: László Bodor > Priority: Major > > Similarly to HIVE-15397, queries can return incorrect results for > metadata-only queries, here is a repro scenario which affects master: > {code} > set hive.support.concurrency=true; > set hive.exec.dynamic.partition.mode=nonstrict; > set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager; > set hive.optimize.metadataonly=true; > create table test1 (id int, val string) partitioned by (val2 string) STORED > AS ORC TBLPROPERTIES ('transactional'='true'); > describe formatted test1; > alter table test1 add partition (val2='foo'); > alter table test1 add partition (val2='bar'); > insert into test1 partition (val2='foo') values (1, 'abc'); > select distinct val2, current_timestamp from test1; > insert into test1 partition (val2='bar') values (1, 'def'); > delete from test1 where val2 = 'bar'; > select '--> hive.optimize.metadataonly=true'; > select distinct val2, current_timestamp from test1; > set hive.optimize.metadataonly=false; > select '--> hive.optimize.metadataonly=false'; > select distinct val2, current_timestamp from test1; > select current_timestamp, * from test1; > {code} > in this case 2 rows returned instead of 1 after a delete with metadata only > optimization: > https://github.com/abstractdog/hive/commit/a7f03513564d01f7c3ba4aa61c4c6537100b4d3f#diff-cb23043000831f41fe7041cb38f82224R114-R128 -- This message was sent by Atlassian Jira (v8.3.4#803005)