[
https://issues.apache.org/jira/browse/HIVE-28880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18010748#comment-18010748
]
Soumyakanti Das commented on HIVE-28880:
----------------------------------------
To understand why this occurs, we need to run {{{}explain vectorization
detail{}}}. We can see the following plan in {{{}Select Vectorization{}}}:
{noformat}
selectExpressions:
IfExprColumnCondExpr(col 12:boolean, col 4:string, col 5:string)(
children:
StringGroupColEqualStringScalar(col 5: string, val AAAAAA)(
children:
StringUpper(col 4:string)(
children:
StringTrimCol(col 1:string) -> 4:string
) -> 5:string
) -> 12:boolean,
ConstantVectorExpression(val AAAA_BBBB_CCCC_DDDD) -> 4:string,
IfExprStringScalarStringScalar(col 13:boolean, val WWWW_XXXX_YYYY_ZZZZ,
val N/A)(
children:
StringGroupColEqualStringScalar(col 6:string, val BBB)(
children:
StringUpper(col 5:string)(
children:
StringTrimCol(col 1:string) -> 5:string
) -> 6:string
) -> 13:boolean
) -> 5:string
) -> 6:string{noformat}
This is the representation of the CASE WHEN statement:
{noformat}
case
when upper(trim(col2)) = 'AAAAAA' then 'AAAA_BBBB_CCCC_DDDD'
when upper(trim(col2)) = 'BBB' then 'WWWW_XXXX_YYYY_ZZZZ'
else 'N/A'{noformat}
There are three children in the {{selectExpressions}} shown above. The first
corresponds to the IF expression, i.e., {{{}when upper(trim(col2)) =
'AAAAAA'{}}}. The second child represents the THEN, and in this case it is
simply a {{{}ConstantVectorExpression{}}}. And the third child is again a
conditional (IF) statement that deals with the second WHEN statement in the
query.
During vectorized execution of the query, after the IF expression (first child)
has been evaluated it is possible to mark and separate the rows into two
different vectors corresponding to the THEN and ELSE case. And thus, after the
second child (THEN expr) has been evaluated, it is possible to produce the
final output for the rows corresponding the THEN case, without evaluating the
third child (ELSE expr) first. This is what we do currently, as seen
[here|https://github.com/apache/hive/blob/46f2783164584bddfef6efdc4d84586bb3114ba1/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/IfExprColumnCondExpr.java#L97].
The issue arises when we write the partial output of IF-THEN to the final
output column (col 6 in this case), and then use the same column during the
evaluation of the ELSE expression (third child), because we can overwrite the
final results in the column. In this particular example, we can see that col 6
is used to store the output of {{{}StringUpper{}}}:
{noformat}
StringUpper(col 5:string)(
children:
StringTrimCol(col 1:string) -> 5:string
) -> 6:string{noformat}
This issue is somewhat similar to
https://issues.apache.org/jira/browse/HIVE-26408.
I think one of the potential solution is to simply ensure that all children
have been evaluated before we start writing to the final output column. We see
that most descendants of {{VectorExpression}} evaluate their children by
calling the
[evaluateChildren|https://github.com/apache/hive/blob/46f2783164584bddfef6efdc4d84586bb3114ba1/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/VectorExpression.java#L320]
method, which evaluates all the children one after another in a for loop. For
IF-THEN-ELSE statements we do not do this as they may have to be evaluated
conditionally sometimes. However, we should still ensure that the flow of
execution is similar and should evaluate all children before writing anything
to the final output column.
> Query giving incorrect results when case is used along with UDFs upper & trim
> and a sub-query
> ---------------------------------------------------------------------------------------------
>
> Key: HIVE-28880
> URL: https://issues.apache.org/jira/browse/HIVE-28880
> Project: Hive
> Issue Type: Bug
> Components: HiveServer2
> Reporter: Taraka Rama Rao Lethavadla
> Assignee: Soumyakanti Das
> Priority: Major
> Attachments: test.q
>
>
> Steps to reproduce:
>
> {noformat}
> #Issue is not reproducible without below configs
> set hive.vectorized.execution.enabled=true;
> set hive.cbo.enable=true;
> set hive.auto.convert.join=true;
> {noformat}
> Create tables & insert few rows
> {noformat}
> CREATE EXTERNAL TABLE main_tbl(col1 string, col2 string) stored as orc;
> CREATE EXTERNAL TABLE sub_tbl(pdate date) stored as orc;
> insert into main_tbl values('20250331','BBB'),('20250331','AAAAAA');
> insert into sub_tbl values('2025-03-31');
> {noformat}
> The below query returns incorrect results
> {noformat}
> select case
> when upper(trim(col2)) = 'AAAAAA' then 'AAAA_BBBB_CCCC_DDDD'
> when upper(trim(col2)) = 'BBB' then 'WWWW_XXXX_YYYY_ZZZZ'
> else 'N/A'
> end as result
> from main_tbl
> where
> cast(concat(substr(trim(col1),1,4),'-',substr(trim(col1),5,2),'-',substr(trim(col1),7,2))
> as date) in (select pdate from sub_tbl);
> Output:
> --------------------
> WWWW_XXXX_YYYY_ZZZZ
> BBBWWWW_XXXX_YYYY_Z{noformat}
> While the expected result is
>
> {noformat}
> -------------------------
> WWWW_XXXX_YYYY_ZZZZ
> AAAA_BBBB_CCCC_DDDD
> {noformat}
> One workaround for this is rephrasing the query to
> {noformat}
> select case
> when upper(trim(col2)) = 'AAAAAA' then 'AAAA_BBBB_CCCC_DDDD'
> when upper(trim(col2)) = 'BBB' then 'WWWW_XXXX_YYYY_ZZZZ'
> else 'N/A'
> end as result
> from main_tbl
> where col1 in (select date_format(pdate, 'yyyyMMdd') from sub_tbl);{noformat}
> Attaching the qtest file here
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)