[ 
https://issues.apache.org/jira/browse/HIVE-28880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18010748#comment-18010748
 ] 

Soumyakanti Das commented on HIVE-28880:
----------------------------------------

To understand why this occurs, we need to run {{{}explain vectorization 
detail{}}}. We can see the following plan in {{{}Select Vectorization{}}}:
{noformat}
selectExpressions: 
  IfExprColumnCondExpr(col 12:boolean, col 4:string, col 5:string)(
    children: 
      StringGroupColEqualStringScalar(col 5: string, val AAAAAA)(
        children: 
          StringUpper(col 4:string)(
            children: 
              StringTrimCol(col 1:string) -> 4:string
          ) -> 5:string
      ) -> 12:boolean, 
      
      ConstantVectorExpression(val AAAA_BBBB_CCCC_DDDD) -> 4:string, 
      
      IfExprStringScalarStringScalar(col 13:boolean, val WWWW_XXXX_YYYY_ZZZZ, 
val N/A)(
        children: 
          StringGroupColEqualStringScalar(col 6:string, val BBB)(
            children: 
              StringUpper(col 5:string)(
                children: 
                  StringTrimCol(col 1:string) -> 5:string
              ) -> 6:string
          ) -> 13:boolean
      ) -> 5:string
  ) -> 6:string{noformat}
This is the representation of the CASE WHEN statement:
{noformat}
case
    when upper(trim(col2)) = 'AAAAAA' then 'AAAA_BBBB_CCCC_DDDD'
    when upper(trim(col2)) = 'BBB' then 'WWWW_XXXX_YYYY_ZZZZ'
    else 'N/A'{noformat}
There are three children in the {{selectExpressions}} shown above. The first 
corresponds to the IF expression, i.e., {{{}when upper(trim(col2)) = 
'AAAAAA'{}}}. The second child represents the THEN, and in this case it is 
simply a {{{}ConstantVectorExpression{}}}. And the third child is again a 
conditional (IF) statement that deals with the second WHEN statement in the 
query.

During vectorized execution of the query, after the IF expression (first child) 
has been evaluated it is possible to mark and separate the rows into two 
different vectors corresponding to the THEN and ELSE case. And thus, after the 
second child (THEN expr) has been evaluated, it is possible to produce the 
final output for the rows corresponding the THEN case, without evaluating the 
third child (ELSE expr) first. This is what we do currently, as seen 
[here|https://github.com/apache/hive/blob/46f2783164584bddfef6efdc4d84586bb3114ba1/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/IfExprColumnCondExpr.java#L97].
 The issue arises when we write the partial output of IF-THEN to the final 
output column (col 6 in this case), and then use the same column during the 
evaluation of the ELSE expression (third child), because we can overwrite the 
final results in the column. In this particular example, we can see that col 6 
is used to store the output of {{{}StringUpper{}}}:
{noformat}
              StringUpper(col 5:string)(
                children: 
                  StringTrimCol(col 1:string) -> 5:string
              ) -> 6:string{noformat}
This issue is somewhat similar to 
https://issues.apache.org/jira/browse/HIVE-26408.

I think one of the potential solution is to simply ensure that all children 
have been evaluated before we start writing to the final output column. We see 
that most descendants of {{VectorExpression}} evaluate their children by 
calling the 
[evaluateChildren|https://github.com/apache/hive/blob/46f2783164584bddfef6efdc4d84586bb3114ba1/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/VectorExpression.java#L320]
 method, which evaluates all the children one after another in a for loop. For 
IF-THEN-ELSE statements we do not do this as they may have to be evaluated 
conditionally sometimes. However, we should still ensure that the flow of 
execution is similar and should evaluate all children before writing anything 
to the final output column.

 

> Query giving incorrect results when case is used along with UDFs upper & trim 
> and a sub-query
> ---------------------------------------------------------------------------------------------
>
>                 Key: HIVE-28880
>                 URL: https://issues.apache.org/jira/browse/HIVE-28880
>             Project: Hive
>          Issue Type: Bug
>          Components: HiveServer2
>            Reporter: Taraka Rama Rao Lethavadla
>            Assignee: Soumyakanti Das
>            Priority: Major
>         Attachments: test.q
>
>
> Steps to reproduce:
>  
> {noformat}
> #Issue is not reproducible without below configs
> set hive.vectorized.execution.enabled=true;
> set hive.cbo.enable=true;
> set hive.auto.convert.join=true;
> {noformat}
> Create tables & insert few rows
> {noformat}
> CREATE EXTERNAL TABLE main_tbl(col1 string, col2 string) stored as orc;
> CREATE EXTERNAL TABLE sub_tbl(pdate date) stored as orc;
> insert into main_tbl values('20250331','BBB'),('20250331','AAAAAA');
> insert into sub_tbl values('2025-03-31');
> {noformat}
> The below query returns incorrect results
> {noformat}
> select case
>     when upper(trim(col2)) = 'AAAAAA' then 'AAAA_BBBB_CCCC_DDDD'
>     when upper(trim(col2)) = 'BBB' then 'WWWW_XXXX_YYYY_ZZZZ'
>     else 'N/A'
> end as result
> from main_tbl
> where 
> cast(concat(substr(trim(col1),1,4),'-',substr(trim(col1),5,2),'-',substr(trim(col1),7,2))
>  as date) in (select pdate from sub_tbl);
> Output:
> --------------------
> WWWW_XXXX_YYYY_ZZZZ
> BBBWWWW_XXXX_YYYY_Z{noformat}
> While the expected result is 
>  
> {noformat}
> -------------------------
> WWWW_XXXX_YYYY_ZZZZ
> AAAA_BBBB_CCCC_DDDD
> {noformat}
> One workaround for this is rephrasing the query to 
> {noformat}
> select case
>     when upper(trim(col2)) = 'AAAAAA' then 'AAAA_BBBB_CCCC_DDDD'
>     when upper(trim(col2)) = 'BBB' then 'WWWW_XXXX_YYYY_ZZZZ'
>     else 'N/A'
> end as result
> from main_tbl
> where col1 in (select date_format(pdate, 'yyyyMMdd') from sub_tbl);{noformat}
> Attaching the qtest file here
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to