[ https://issues.apache.org/jira/browse/HIVE-23265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17089580#comment-17089580 ]
Attila Magyar commented on HIVE-23265: -------------------------------------- Hey [~chiran54321], thanks for reporting this. I think the problem is that in VectorLimitOperator we only update the selected array but we leave the batch size as it is. I uploaded a potential fix, let's see if it breaks any other tests. > Duplicate rowsets are returned with Limit and Offset ste > -------------------------------------------------------- > > Key: HIVE-23265 > URL: https://issues.apache.org/jira/browse/HIVE-23265 > Project: Hive > Issue Type: Bug > Components: HiveServer2, Vectorization > Affects Versions: 3.1.0, 3.1.2 > Reporter: Chiran Ravani > Assignee: Attila Magyar > Priority: Critical > Attachments: 000000_0, HIVE-23265.1.patch > > > We have a query which produces duplicate results even when there is no > duplicate records in underlying tables. > Sample Query > {code:java} > select * from orderdatatest_ext order by col1 limit 1000,50 > {code} > The problem appears when order by clause is used with col1 having non-unique > rows. Apparently the duplicates are being produced during reducer phase of > the query. > set hive.vectorized.execution.reduce.enabled=false does not cause the problem. > Data in table is as follows. > {code:java} > 1,1 > 1,2 > 1,3 > . > . > 1,1500 > {code} > Results with hive.vectorized.execution.reduce.enabled=true > {code:java} > +-------------------------+-------------------------+ > | orderdatatest_ext.col1 | orderdatatest_ext.col2 | > +-------------------------+-------------------------+ > | 1 | 1001 | > | 1 | 1002 | > | 1 | 1003 | > | 1 | 1004 | > | 1 | 1005 | > | 1 | 1006 | > | 1 | 1007 | > | 1 | 1008 | > | 1 | 1009 | > | 1 | 1010 | > | 1 | 1011 | > | 1 | 1012 | > | 1 | 1013 | > | 1 | 1014 | > | 1 | 1015 | > | 1 | 1016 | > | 1 | 1017 | > | 1 | 1018 | > | 1 | 1019 | > | 1 | 1020 | > | 1 | 1021 | > | 1 | 1022 | > | 1 | 1023 | > | 1 | 1024 | > | 1 | 1 | > | 1 | 1 | > | 1 | 1 | > | 1 | 1 | > | 1 | 1 | > | 1 | 1 | > | 1 | 1 | > | 1 | 1 | > | 1 | 1 | > | 1 | 1 | > | 1 | 1 | > | 1 | 1 | > | 1 | 1 | > | 1 | 1 | > | 1 | 1 | > | 1 | 1 | > | 1 | 1 | > | 1 | 1 | > | 1 | 1 | > | 1 | 1 | > | 1 | 1 | > | 1 | 1 | > | 1 | 1 | > | 1 | 1 | > | 1 | 1 | > | 1 | 1 | > +-------------------------+-------------------------+ > {code} > Results with hive.vectorized.execution.reduce.enabled=false > {code:java} > +-------------------------+-------------------------+ > | orderdatatest_ext.col1 | orderdatatest_ext.col2 | > +-------------------------+-------------------------+ > | 1 | 1001 | > | 1 | 1002 | > | 1 | 1003 | > | 1 | 1004 | > | 1 | 1005 | > | 1 | 1006 | > | 1 | 1007 | > | 1 | 1008 | > | 1 | 1009 | > | 1 | 1010 | > | 1 | 1011 | > | 1 | 1012 | > | 1 | 1013 | > | 1 | 1014 | > | 1 | 1015 | > | 1 | 1016 | > | 1 | 1017 | > | 1 | 1018 | > | 1 | 1019 | > | 1 | 1020 | > | 1 | 1021 | > | 1 | 1022 | > | 1 | 1023 | > | 1 | 1024 | > | 1 | 1025 | > | 1 | 1026 | > | 1 | 1027 | > | 1 | 1028 | > | 1 | 1029 | > | 1 | 1030 | > | 1 | 1031 | > | 1 | 1032 | > | 1 | 1033 | > | 1 | 1034 | > | 1 | 1035 | > | 1 | 1036 | > | 1 | 1037 | > | 1 | 1038 | > | 1 | 1039 | > | 1 | 1040 | > | 1 | 1041 | > | 1 | 1042 | > | 1 | 1043 | > | 1 | 1044 | > | 1 | 1045 | > | 1 | 1046 | > | 1 | 1047 | > | 1 | 1048 | > | 1 | 1049 | > | 1 | 1050 | > +-------------------------+-------------------------+ > {code} > Table DDL > {code:java} > CREATE EXTERNAL TABLE orderdatatest_ext (col1 int, col2 int) stored as orc > {code} > Attached sample ORC file. > Problem appears to be with VectorLimitOperator. > {code} > 2020-04-20 15:35:49,693 [INFO] [TezChild] |vector.VectorSelectOperator|: > Initializing operator SEL[6] > 2020-04-20 15:35:49,747 [INFO] [TezChild] |vector.VectorSelectOperator|: > RECORDS_OUT_INTERMEDIATE_Map_1:0, RECORDS_OUT_OPERATOR_SEL_6:1500, > 2020-04-20 15:35:50,142 [INFO] [TezChild] |vector.VectorSelectOperator|: > Initializing operator SEL[8] > 2020-04-20 15:35:50,303 [INFO] [TezChild] |vector.VectorSelectOperator|: > RECORDS_OUT_OPERATOR_SEL_8:1050, RECORDS_OUT_INTERMEDIATE_Reducer_2:0, > 2020-04-20 15:35:50,142 [INFO] [TezChild] |vector.VectorLimitOperator|: > Initializing operator LIM[9] > 2020-04-20 15:35:50,303 [INFO] [TezChild] |vector.VectorLimitOperator|: > RECORDS_OUT_INTERMEDIATE_Reducer_2:0, RECORDS_OUT_OPERATOR_LIM_9:1050, > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)