[
https://issues.apache.org/jira/browse/HIVE-29084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Konstantin Bereznyakov updated HIVE-29084:
------------------------------------------
Description:
When CBO is enabled, queries with multiple LATERAL VIEW explode operations and
WHERE clause filtering return incorrect results. WHERE clause conditions could
be dropped during post-Calcite PPD processing, causing the query to return rows
that should be filtered out.
I can reliably reproduce the bug with 2+ occurrences of "LATERAL VIEW
explode()" *and* a WHERE clause; it does not seem to happen with only one
"LATERAL VIEW explode()" or without a WHERE clause.
Originally reproduced with published 4.0.1 container, then verified with
locally built current master branch.
I was *not* able to reproduce it with unit tests.
{{*Minimal Reproduction Case:*}}
{{ }}
{{ DROP TABLE IF EXISTS cbo_lateral_view_where_bug;}}
{{ CREATE TABLE cbo_lateral_view_where_bug (values array<string>);}}
{{ INSERT INTO cbo_lateral_view_where_bug SELECT array('a', 'b');}}
{{{} ANALYZE TABLE cbo_lateral_view_where_bug COMPUTE STATISTICS;{}}}{\{ **
}}
{{ *-- Correct result (CBO OFF)*}}
{{ SET hive.cbo.enable=false;}}
{{ SELECT first_val, second_val}}
{{ FROM cbo_lateral_view_where_bug}}
{{ LATERAL VIEW explode(values) lv1 AS first_val}}
{{ LATERAL VIEW explode(values) lv2 AS second_val}}
{{ WHERE first_val != second_val;}}{{ *-- Incorrect result (CBO ON)*}}
{{ SET hive.cbo.enable=true;}}
{{ SELECT first_val, second_val}}
{{ FROM cbo_lateral_view_where_bug}}
{{ LATERAL VIEW explode(values) lv1 AS first_val}}
{{ LATERAL VIEW explode(values) lv2 AS second_val}}
{{{} WHERE first_val != second_val;{}}}{\{ ** }}
{{ *–- Expected vs Actual Results:*}}
{{ - CBO OFF (correct): 2 rows: ('a','b') and ('b','a')}}
{{ - CBO ON (incorrect): 4 rows: ('a','b'), ('b','a'), ('a','a'), ('b','b')}}
*Root Cause:*
ASTConverter's LV processing assigns the same tableAlias value to all columns
of the query, even ones generated as LATERAL VIEW. This causes certain
conditions (OR, "!=") between columns of different LV clauses (and even the LV
clause and the base table) to be "incorrectly pushable" during PPD processing
*Impact:*
- Data Correctness: Queries return wrong results, not just performance issues
- Scope: Requires 2+ LATERAL VIEW explode operations with WHERE filtering
- Severity: Silent data corruption - users may not notice incorrect results
{*}Workaround{*}:
Set {*}hive.cbo.enable{*}=false for queries with multiple lateral view
cross-products and WHERE clause filtering. Disabling PPD also seems to bypass
the problem.
Attachments:
- [^hive_cbo_lateral_view_where_bug.sql] - SQL to reproduce
- Hive log: [^hive_cbo_lateral_view_where_bug.log]
was:
When CBO is enabled, queries with multiple LATERAL VIEW explode operations and
WHERE clause filtering return incorrect results. WHERE clause conditions could
be dropped during post-Calcite PPD processing, causing the query to return rows
that should be filtered out.
I can reliably reproduce the bug with 2+ occurrences of "LATERAL VIEW
explode()" *and* a WHERE clause; it does not seem to happen with only one
"LATERAL VIEW explode()" or without a WHERE clause.
Originally reproduced with published 4.0.1 container, then verified with
locally built current master branch.
I was *not* able to reproduce it with unit tests.
{{*Minimal Reproduction Case:*}}
{{ }}
{{ DROP TABLE IF EXISTS cbo_lateral_view_where_bug;}}
{{ CREATE TABLE cbo_lateral_view_where_bug (values array<string>);}}
{{ INSERT INTO cbo_lateral_view_where_bug SELECT array('a', 'b');}}
{{{} ANALYZE TABLE cbo_lateral_view_where_bug COMPUTE STATISTICS;{}}}{\{ **
}}
{{ *-- Correct result (CBO OFF)*}}
{{ SET hive.cbo.enable=false;}}
{{ SELECT first_val, second_val}}
{{ FROM cbo_lateral_view_where_bug}}
{{ LATERAL VIEW explode(values) lv1 AS first_val}}
{{ LATERAL VIEW explode(values) lv2 AS second_val}}
{{ WHERE first_val != second_val;}}{{ *-- Incorrect result (CBO ON)*}}
{{ SET hive.cbo.enable=true;}}
{{ SELECT first_val, second_val}}
{{ FROM cbo_lateral_view_where_bug}}
{{ LATERAL VIEW explode(values) lv1 AS first_val}}
{{ LATERAL VIEW explode(values) lv2 AS second_val}}
{{{} WHERE first_val != second_val;{}}}{\{ ** }}
{{ *–- Expected vs Actual Results:*}}
{{ - CBO OFF (correct): 2 rows: ('a','b') and ('b','a')}}
{{ - CBO ON (incorrect): 4 rows: ('a','b'), ('b','a'), ('a','a'), ('b','b')}}
*Root Cause:*
ASTConverter's LV processing assigns the same tableAlias value to all columns
of the query, even ones generated as LATERAL VIEW. This causes certain
conditions (OR, "!=") between columns of different LV statements to be
"incorrectly pushable" during PPD processing
*Impact:*
- Data Correctness: Queries return wrong results, not just performance issues
- Scope: Requires 2+ LATERAL VIEW explode operations with WHERE filtering
- Severity: Silent data corruption - users may not notice incorrect results
{*}Workaround{*}:
Set {*}hive.cbo.enable{*}=false for queries with multiple lateral view
cross-products and WHERE clause filtering. Disabling PPD also seems to bypass
the problem.
Attachments:
- [^hive_cbo_lateral_view_where_bug.sql] - SQL to reproduce
- Hive log: [^hive_cbo_lateral_view_where_bug.log]
> CBO/PPD cause WHERE filter "loss" for certain LV queries
> --------------------------------------------------------
>
> Key: HIVE-29084
> URL: https://issues.apache.org/jira/browse/HIVE-29084
> Project: Hive
> Issue Type: Bug
> Components: CBO
> Affects Versions: 4.1.0, 4.0.1, 4.2.0
> Environment: - Affected Versions: Hive 4.0.1, 4.2.0-SNAPSHOT
> (likely all CBO+lateral view versions)
> - Execution Engine: MapReduce (local mode)
> - Metastore: Derby embedded
> Reporter: Konstantin Bereznyakov
> Assignee: Konstantin Bereznyakov
> Priority: Major
> Labels: pull-request-available
> Attachments: HIVE-29084-ASTConverter.patch,
> HIVE-29084-after-fix-filter-schema-different-table-aliases.png,
> HIVE-29084-before-fix-filter-schema-different-table-aliases.png,
> hive-29084.POC.patch, hive_29084.q, hive_29084.q.out,
> hive_cbo_lateral_view_where_bug.log, hive_cbo_lateral_view_where_bug.sql
>
>
> When CBO is enabled, queries with multiple LATERAL VIEW explode operations
> and WHERE clause filtering return incorrect results. WHERE clause conditions
> could be dropped during post-Calcite PPD processing, causing the query to
> return rows that should be filtered out.
> I can reliably reproduce the bug with 2+ occurrences of "LATERAL VIEW
> explode()" *and* a WHERE clause; it does not seem to happen with only one
> "LATERAL VIEW explode()" or without a WHERE clause.
> Originally reproduced with published 4.0.1 container, then verified with
> locally built current master branch.
> I was *not* able to reproduce it with unit tests.
> {{*Minimal Reproduction Case:*}}
> {{ }}
> {{ DROP TABLE IF EXISTS cbo_lateral_view_where_bug;}}
> {{ CREATE TABLE cbo_lateral_view_where_bug (values array<string>);}}
> {{ INSERT INTO cbo_lateral_view_where_bug SELECT array('a', 'b');}}
> {{{} ANALYZE TABLE cbo_lateral_view_where_bug COMPUTE STATISTICS;{}}}{\{ **
> }}
> {{ *-- Correct result (CBO OFF)*}}
> {{ SET hive.cbo.enable=false;}}
> {{ SELECT first_val, second_val}}
> {{ FROM cbo_lateral_view_where_bug}}
> {{ LATERAL VIEW explode(values) lv1 AS first_val}}
> {{ LATERAL VIEW explode(values) lv2 AS second_val}}
> {{ WHERE first_val != second_val;}}{{ *-- Incorrect result (CBO ON)*}}
> {{ SET hive.cbo.enable=true;}}
> {{ SELECT first_val, second_val}}
> {{ FROM cbo_lateral_view_where_bug}}
> {{ LATERAL VIEW explode(values) lv1 AS first_val}}
> {{ LATERAL VIEW explode(values) lv2 AS second_val}}
> {{{} WHERE first_val != second_val;{}}}{\{ ** }}
> {{ *–- Expected vs Actual Results:*}}
> {{ - CBO OFF (correct): 2 rows: ('a','b') and ('b','a')}}
> {{ - CBO ON (incorrect): 4 rows: ('a','b'), ('b','a'), ('a','a'), ('b','b')}}
> *Root Cause:*
> ASTConverter's LV processing assigns the same tableAlias value to all
> columns of the query, even ones generated as LATERAL VIEW. This causes
> certain conditions (OR, "!=") between columns of different LV clauses (and
> even the LV clause and the base table) to be "incorrectly pushable" during
> PPD processing
> *Impact:*
> - Data Correctness: Queries return wrong results, not just performance
> issues
> - Scope: Requires 2+ LATERAL VIEW explode operations with WHERE filtering
> - Severity: Silent data corruption - users may not notice incorrect results
> {*}Workaround{*}:
> Set {*}hive.cbo.enable{*}=false for queries with multiple lateral view
> cross-products and WHERE clause filtering. Disabling PPD also seems to bypass
> the problem.
>
> Attachments:
> - [^hive_cbo_lateral_view_where_bug.sql] - SQL to reproduce
> - Hive log: [^hive_cbo_lateral_view_where_bug.log]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)