[GitHub] [hive] kgyrtkirk commented on a change in pull request #2027: HIVE-24817: A "not in" clause returns incorrect data when there is co…

GitBox Tue, 02 Mar 2021 03:50:21 -0800


kgyrtkirk commented on a change in pull request #2027:
URL: https://github.com/apache/hive/pull/2027#discussion_r585391616




##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/parse/type/TypeCheckProcFactory.java
##########
@@ -1007,17 +1001,12 @@ protected T getXpathOrFuncExprNodeDesc(ASTNode node,
             T columnDesc = children.get(0);
             T valueDesc = interpretNode(columnDesc, children.get(i));
             if (valueDesc == null) {
-              if (hasNullValue) {
-                // Skip if null value has already been added
-                continue;
-              }
-              TypeInfo targetType = exprFactory.getTypeInfo(columnDesc);
+              // Keep original
+              TypeInfo targetType = exprFactory.getTypeInfo(children.get(i));
               if (!expressions.containsKey(targetType)) {
                 expressions.put(targetType, columnDesc);
               }
-              T nullConst = exprFactory.createConstantExpr(targetType, null);
-              expressions.put(targetType, nullConst);
-              hasNullValue = true;
+              expressions.put(targetType, children.get(i));
             } else {

Review comment:
       I was going thru here and there and I think there might be another way 
around this problem which could retain this optimization as well:
   * introduce a new `NOT` operator: which can be controlled to return 
true/false in case of null values
   * in case of filter expressions start using the new not operator; and switch 
mode below every `NOT` operator
   
   but this feels like a more complicated change - we should only do it if we 
loose important optimizations

##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/parse/type/TypeCheckProcFactory.java
##########
@@ -1007,17 +1001,12 @@ protected T getXpathOrFuncExprNodeDesc(ASTNode node,
             T columnDesc = children.get(0);
             T valueDesc = interpretNode(columnDesc, children.get(i));
             if (valueDesc == null) {
-              if (hasNullValue) {
-                // Skip if null value has already been added
-                continue;
-              }
-              TypeInfo targetType = exprFactory.getTypeInfo(columnDesc);
+              // Keep original
+              TypeInfo targetType = exprFactory.getTypeInfo(children.get(i));
               if (!expressions.containsKey(targetType)) {
                 expressions.put(targetType, columnDesc);
               }
-              T nullConst = exprFactory.createConstantExpr(targetType, null);
-              expressions.put(targetType, nullConst);
-              hasNullValue = true;
+              expressions.put(targetType, children.get(i));
             } else {
               TypeInfo targetType = exprFactory.getTypeInfo(valueDesc);
               if (!expressions.containsKey(targetType)) {

Review comment:
       this if statement has no effect - the map value will be overwritten 
anyway ; I wonder if we have a bug here

##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/parse/type/TypeCheckProcFactory.java
##########
@@ -1007,17 +1001,12 @@ protected T getXpathOrFuncExprNodeDesc(ASTNode node,
             T columnDesc = children.get(0);
             T valueDesc = interpretNode(columnDesc, children.get(i));
             if (valueDesc == null) {
-              if (hasNullValue) {
-                // Skip if null value has already been added
-                continue;
-              }
-              TypeInfo targetType = exprFactory.getTypeInfo(columnDesc);
+              // Keep original
+              TypeInfo targetType = exprFactory.getTypeInfo(children.get(i));
               if (!expressions.containsKey(targetType)) {
                 expressions.put(targetType, columnDesc);
               }
-              T nullConst = exprFactory.createConstantExpr(targetType, null);
-              expressions.put(targetType, nullConst);
-              hasNullValue = true;
+              expressions.put(targetType, children.get(i));

Review comment:
       for `IN` the original logic is valid as long as it's in 
`UnknownAs.FALSE` mode...but for `NOT IN` the correct interpretation would be 
`UnknownAs.TRUE`.
   
   I think we might be better off not coping with the `UnknownAs` devils here - 
and retain the original expressions as in the current proposed patch; I'm not 
sure how much optimization opportunities/performance we will loose that way.
   
   

##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/parse/type/TypeCheckProcFactory.java
##########
@@ -1007,17 +1001,12 @@ protected T getXpathOrFuncExprNodeDesc(ASTNode node,
             T columnDesc = children.get(0);
             T valueDesc = interpretNode(columnDesc, children.get(i));
             if (valueDesc == null) {
-              if (hasNullValue) {
-                // Skip if null value has already been added
-                continue;
-              }
-              TypeInfo targetType = exprFactory.getTypeInfo(columnDesc);
+              // Keep original
+              TypeInfo targetType = exprFactory.getTypeInfo(children.get(i));
               if (!expressions.containsKey(targetType)) {
                 expressions.put(targetType, columnDesc);
               }
-              T nullConst = exprFactory.createConstantExpr(targetType, null);
-              expressions.put(targetType, nullConst);
-              hasNullValue = true;
+              expressions.put(targetType, children.get(i));
             } else {

Review comment:
       other idea could be to retain and do these optimizations but only if we 
are not below a NOT operator

##########
File path: ql/src/test/queries/clientpositive/in_coercion.q
##########
@@ -0,0 +1,14 @@
+DROP TABLE src_table;
+CREATE TABLE src_table (key int);
+LOAD DATA LOCAL INPATH '../../data/files/kv6.txt' INTO TABLE src_table;
+
+-- verify table has data
+select count(*) from src_table; 

Review comment:
       note: we may use the `assert_true` udf here (so that no one will be able 
to just silently overwrite the q.out)
   ```
   select assert_true(count(*) = 100) from src_table
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [hive] kgyrtkirk commented on a change in pull request #2027: HIVE-24817: A "not in" clause returns incorrect data when there is co…

Reply via email to