[
https://issues.apache.org/jira/browse/HIVE-29084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18007343#comment-18007343
]
Konstantin Bereznyakov commented on HIVE-29084:
-----------------------------------------------
uzing Tez, EXPLAN EXTENDED with CBO ON/OFF:
{code}
SET hive.cbo.enable=true;
No rows affected (0.007 seconds)
0: jdbc:hive2://localhost:10000> SET hive.cbo.enable;
+-----------------------+
| set |
+-----------------------+
| hive.cbo.enable=true |
+-----------------------+
1 row selected (0.01 seconds)
0: jdbc:hive2://localhost:10000> EXPLAIN EXTENDED SELECT first_val, second_val
. . . . . . . . . . . . . . . .> FROM (SELECT array('a', 'b') as val_array)
inline_data
. . . . . . . . . . . . . . . .> LATERAL VIEW explode(val_array) lv1 AS
first_val
. . . . . . . . . . . . . . . .> LATERAL VIEW explode(val_array) lv2 AS
second_val
. . . . . . . . . . . . . . . .> WHERE first_val != second_val;
INFO : Compiling
command(queryId=root_20250715220200_00940b48-adba-4296-ae50-75b5e92a0c7b):
EXPLAIN EXTENDED SELECT first_val, second_val
FROM (SELECT array('a', 'b') as val_array) inline_data
LATERAL VIEW explode(val_array) lv1 AS first_val
LATERAL VIEW explode(val_array) lv2 AS second_val
WHERE first_val != second_val
INFO : Semantic Analysis Completed (retrial = false)
INFO : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:Explain,
type:string, comment:null)], properties:null)
INFO : Completed compiling
command(queryId=root_20250715220200_00940b48-adba-4296-ae50-75b5e92a0c7b); Time
taken: 0.216 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing
command(queryId=root_20250715220200_00940b48-adba-4296-ae50-75b5e92a0c7b):
EXPLAIN EXTENDED SELECT first_val, second_val
FROM (SELECT array('a', 'b') as val_array) inline_data
LATERAL VIEW explode(val_array) lv1 AS first_val
LATERAL VIEW explode(val_array) lv2 AS second_val
WHERE first_val != second_val
INFO : Starting task [Stage-1:EXPLAIN] in serial mode
INFO : Completed executing
command(queryId=root_20250715220200_00940b48-adba-4296-ae50-75b5e92a0c7b); Time
taken: 0.031 seconds
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-0 is a root stage |
| |
| STAGE PLANS: |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| TableScan |
| alias: _dummy_table |
| Row Limit Per Split: 1 |
| GatherStats: false |
| Select Operator |
| expressions: array('a','b') (type: array<string>) |
| outputColumnNames: _col0 |
| Lateral View Forward |
| Select Operator |
| expressions: _col0 (type: array<string>) |
| outputColumnNames: _col0 |
| Lateral View Join Operator |
| outputColumnNames: _col0, _col1 |
| Lateral View Forward |
| Select Operator |
| expressions: _col1 (type: string) |
| outputColumnNames: _col1 |
| Lateral View Join Operator |
| outputColumnNames: _col1, _col2 |
| Select Operator |
| expressions: _col1 (type: string), _col2 (type:
string) |
| outputColumnNames: _col0, _col1 |
| ListSink |
| Select Operator |
| expressions: _col0 (type: array<string>) |
| outputColumnNames: _col0 |
| UDTF Operator |
| function name: explode |
| Lateral View Join Operator |
| outputColumnNames: _col1, _col2 |
| Select Operator |
| expressions: _col1 (type: string), _col2 (type:
string) |
| outputColumnNames: _col0, _col1 |
| ListSink |
| Select Operator |
| expressions: Const array<string> [a, b] (type: array<string>)
|
| outputColumnNames: _col0 |
| UDTF Operator |
| function name: explode |
| Lateral View Join Operator |
| outputColumnNames: _col0, _col1 |
| Lateral View Forward |
| Select Operator |
| expressions: _col1 (type: string) |
| outputColumnNames: _col1 |
| Lateral View Join Operator |
| outputColumnNames: _col1, _col2 |
| Select Operator |
| expressions: _col1 (type: string), _col2 (type:
string) |
| outputColumnNames: _col0, _col1 |
| ListSink |
| Select Operator |
| expressions: _col0 (type: array<string>) |
| outputColumnNames: _col0 |
| UDTF Operator |
| function name: explode |
| Lateral View Join Operator |
| outputColumnNames: _col1, _col2 |
| Select Operator |
| expressions: _col1 (type: string), _col2 (type:
string) |
| outputColumnNames: _col0, _col1 |
| ListSink |
| |
+----------------------------------------------------+
71 rows selected (0.265 seconds)
0: jdbc:hive2://localhost:10000> SET hive.cbo.enable=false;
No rows affected (0.005 seconds)
0: jdbc:hive2://localhost:10000> SET hive.cbo.enable;
+------------------------+
| set |
+------------------------+
| hive.cbo.enable=false |
+------------------------+
1 row selected (0.007 seconds)
0: jdbc:hive2://localhost:10000> EXPLAIN EXTENDED SELECT first_val, second_val
. . . . . . . . . . . . . . . .> FROM (SELECT array('a', 'b') as val_array)
inline_data
. . . . . . . . . . . . . . . .> LATERAL VIEW explode(val_array) lv1 AS
first_val
. . . . . . . . . . . . . . . .> LATERAL VIEW explode(val_array) lv2 AS
second_val
. . . . . . . . . . . . . . . .> WHERE first_val != second_val;
INFO : Compiling
command(queryId=root_20250715220220_a4c8999e-6aa2-4546-8aa0-d5c5128a5d80):
EXPLAIN EXTENDED SELECT first_val, second_val
FROM (SELECT array('a', 'b') as val_array) inline_data
LATERAL VIEW explode(val_array) lv1 AS first_val
LATERAL VIEW explode(val_array) lv2 AS second_val
WHERE first_val != second_val
INFO : Semantic Analysis Completed (retrial = false)
INFO : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:Explain,
type:string, comment:null)], properties:null)
INFO : Completed compiling
command(queryId=root_20250715220220_a4c8999e-6aa2-4546-8aa0-d5c5128a5d80); Time
taken: 0.095 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing
command(queryId=root_20250715220220_a4c8999e-6aa2-4546-8aa0-d5c5128a5d80):
EXPLAIN EXTENDED SELECT first_val, second_val
FROM (SELECT array('a', 'b') as val_array) inline_data
LATERAL VIEW explode(val_array) lv1 AS first_val
LATERAL VIEW explode(val_array) lv2 AS second_val
WHERE first_val != second_val
INFO : Starting task [Stage-1:EXPLAIN] in serial mode
INFO : Completed executing
command(queryId=root_20250715220220_a4c8999e-6aa2-4546-8aa0-d5c5128a5d80); Time
taken: 0.034 seconds
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| STAGE DEPENDENCIES: |
| Stage-0 is a root stage |
| |
| STAGE PLANS: |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| TableScan |
| alias: _dummy_table |
| Row Limit Per Split: 1 |
| GatherStats: false |
| Select Operator |
| expressions: array('a','b') (type: array<string>) |
| outputColumnNames: _col0 |
| Lateral View Forward |
| Select Operator |
| expressions: _col0 (type: array<string>) |
| outputColumnNames: _col0 |
| Lateral View Join Operator |
| outputColumnNames: _col0, _col1 |
| Lateral View Forward |
| Select Operator |
| expressions: _col1 (type: string) |
| outputColumnNames: _col1 |
| Lateral View Join Operator |
| outputColumnNames: _col1, _col2 |
| Filter Operator |
| isSamplingPred: false |
| predicate: (_col1 <> _col2) (type: boolean) |
| Select Operator |
| expressions: _col1 (type: string), _col2 (type:
string) |
| outputColumnNames: _col0, _col1 |
| ListSink |
| Select Operator |
| expressions: _col0 (type: array<string>) |
| outputColumnNames: _col0 |
| UDTF Operator |
| function name: explode |
| Lateral View Join Operator |
| outputColumnNames: _col1, _col2 |
| Filter Operator |
| isSamplingPred: false |
| predicate: (_col1 <> _col2) (type: boolean) |
| Select Operator |
| expressions: _col1 (type: string), _col2 (type:
string) |
| outputColumnNames: _col0, _col1 |
| ListSink |
| Select Operator |
| expressions: Const array<string> [a, b] (type: array<string>)
|
| outputColumnNames: _col0 |
| UDTF Operator |
| function name: explode |
| Lateral View Join Operator |
| outputColumnNames: _col0, _col1 |
| Lateral View Forward |
| Select Operator |
| expressions: _col1 (type: string) |
| outputColumnNames: _col1 |
| Lateral View Join Operator |
| outputColumnNames: _col1, _col2 |
| Filter Operator |
| isSamplingPred: false |
| predicate: (_col1 <> _col2) (type: boolean) |
| Select Operator |
| expressions: _col1 (type: string), _col2 (type:
string) |
| outputColumnNames: _col0, _col1 |
| ListSink |
| Select Operator |
| expressions: _col0 (type: array<string>) |
| outputColumnNames: _col0 |
| UDTF Operator |
| function name: explode |
| Lateral View Join Operator |
| outputColumnNames: _col1, _col2 |
| Filter Operator |
| isSamplingPred: false |
| predicate: (_col1 <> _col2) (type: boolean) |
| Select Operator |
| expressions: _col1 (type: string), _col2
(type: string) |
| outputColumnNames: _col0, _col1 |
| ListSink |
| |
+----------------------------------------------------+
83 rows selected (0.145 seconds){code}
> CBO returns incorrect results when WHERE clause filters multiple LATERAL VIEW
> explode cross-products
> ----------------------------------------------------------------------------------------------------
>
> Key: HIVE-29084
> URL: https://issues.apache.org/jira/browse/HIVE-29084
> Project: Hive
> Issue Type: Bug
> Components: CBO
> Affects Versions: 4.1.0, 4.0.1
> Environment: - Affected Versions: Hive 4.0.1, 4.2.0-SNAPSHOT
> (likely all CBO+lateral view versions)
> - Execution Engine: MapReduce (local mode)
> - Metastore: Derby embedded
> Reporter: Konstantin Bereznyakov
> Assignee: Konstantin Bereznyakov
> Priority: Major
> Attachments: hive_cbo_lateral_view_where_bug.log,
> hive_cbo_lateral_view_where_bug.sql
>
>
> When CBO is enabled, queries with multiple LATERAL VIEW explode operations
> and WHERE clause filtering return incorrect results. The WHERE clause
> selectivity is not properly applied during optimization, causing the query to
> return rows that should be filtered out.
> I can reliably reproduce the bug with 2+ occurrences of "LATERAL VIEW
> explode()" *and* a WHERE clause; it does not seem to happen with only one
> "LATERAL VIEW explode()" or without a WHERE clause.
> Originally reproduced with published 4.0.1 container, then verified with
> locally built current master branch.
> I was *not* able to reproduce it with unit tests.
> {{*Minimal Reproduction Case:*}}
> {{ }}
> {{ DROP TABLE IF EXISTS cbo_lateral_view_where_bug;}}
> {{ CREATE TABLE cbo_lateral_view_where_bug (values array<string>);}}
> {{ INSERT INTO cbo_lateral_view_where_bug SELECT array('a', 'b');}}
> {{ ANALYZE TABLE cbo_lateral_view_where_bug COMPUTE STATISTICS;}}{{ ** }}
> {{ *-- Correct result (CBO OFF)*}}
> {{ SET hive.cbo.enable=false;}}
> {{ SELECT first_val, second_val}}
> {{ FROM cbo_lateral_view_where_bug}}
> {{ LATERAL VIEW explode(values) lv1 AS first_val}}
> {{ LATERAL VIEW explode(values) lv2 AS second_val}}
> {{ WHERE first_val != second_val;}}{{ *-- Incorrect result (CBO ON)*}}
> {{ SET hive.cbo.enable=true;}}
> {{ SELECT first_val, second_val}}
> {{ FROM cbo_lateral_view_where_bug}}
> {{ LATERAL VIEW explode(values) lv1 AS first_val}}
> {{ LATERAL VIEW explode(values) lv2 AS second_val}}
> {{ WHERE first_val != second_val;}}{{ ** }}
> {{ *–- Expected vs Actual Results:*}}
> {{ - CBO OFF (correct): 2 rows: ('a','b') and ('b','a')}}
> {{ - CBO ON (incorrect): 4 rows: ('a','b'), ('b','a'), ('a','a'), ('b','b')}}
> *Root Cause:*
> CBO fails to properly estimate the selectivity of WHERE clauses that filter
> cross-products created by multiple lateral views on the same array. The
> optimizer appears to ignore or miscalculate the filtering effect, allowing
> rows that should be excluded.
> *Impact:*
> - Data Correctness: Queries return wrong results, not just performance
> issues
> - Scope: Requires 2+ LATERAL VIEW explode operations with WHERE filtering
> - Severity: Silent data corruption - users may not notice incorrect results
> {*}Workaround{*}:
> Set {*}hive.cbo.enable{*}=false for queries with multiple lateral view
> cross-products and WHERE clause filtering.
>
> Attachments:
> - [^hive_cbo_lateral_view_where_bug.sql] - SQL to reproduce
> - Hive log: [^hive_cbo_lateral_view_where_bug.log]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)