[ 
https://issues.apache.org/jira/browse/HIVE-29084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Bereznyakov updated HIVE-29084:
------------------------------------------
    Description: 
When CBO is enabled, queries with multiple LATERAL VIEW explode operations and 
WHERE clause filtering return incorrect results. WHERE clause conditions could 
be dropped during post-Calcite PPD processing, causing the query to return rows 
that should be filtered out.

I can reliably reproduce the bug with 2+ occurrences of "LATERAL VIEW 
explode()" *and* a WHERE clause; it does not seem to happen with only one 
"LATERAL VIEW explode()" or without a WHERE clause.

Originally reproduced with published 4.0.1 container, then verified with 
locally built current master branch.

I was *not* able to reproduce it with unit tests.

{{*Minimal Reproduction Case:*}}
{{  }}
{{  DROP TABLE IF EXISTS cbo_lateral_view_where_bug;}}
{{  CREATE TABLE cbo_lateral_view_where_bug (values array<string>);}}
{{  INSERT INTO cbo_lateral_view_where_bug SELECT array('a', 'b');}}
{{{}  ANALYZE TABLE cbo_lateral_view_where_bug COMPUTE STATISTICS;{}}}{\{  **  
}}

{{  *-- Correct result (CBO OFF)*}}
{{  SET hive.cbo.enable=false;}}
{{  SELECT first_val, second_val}}
{{  FROM cbo_lateral_view_where_bug}}
{{  LATERAL VIEW explode(values) lv1 AS first_val}}
{{  LATERAL VIEW explode(values) lv2 AS second_val}}
{{  WHERE first_val != second_val;}}{{  *-- Incorrect result (CBO ON)*}}
{{  SET hive.cbo.enable=true;}}
{{  SELECT first_val, second_val}}
{{  FROM cbo_lateral_view_where_bug}}
{{  LATERAL VIEW explode(values) lv1 AS first_val}}
{{  LATERAL VIEW explode(values) lv2 AS second_val}}
{{{}  WHERE first_val != second_val;{}}}{\{  **  }}

{{  *–- Expected vs Actual Results:*}}
{{  - CBO OFF (correct): 2 rows: ('a','b') and ('b','a')}}
{{  - CBO ON (incorrect): 4 rows: ('a','b'), ('b','a'), ('a','a'), ('b','b')}}

  *Root Cause:*
  ASTConverter's LV processing assigns the same tableAlias value to all columns 
of the query, even ones generated as LATERAL VIEW. This causes certain 
conditions (OR, "!=") between columns of different LV statements to be 
"incorrectly pushable" during PPD processing

  *Impact:*
  - Data Correctness: Queries return wrong results, not just performance issues
  - Scope: Requires 2+ LATERAL VIEW explode operations with WHERE filtering
  - Severity: Silent data corruption - users may not notice incorrect results

  {*}Workaround{*}:
Set {*}hive.cbo.enable{*}=false for queries with multiple lateral view 
cross-products and WHERE clause filtering. Disabling PPD also seems to bypass 
the problem.

 

Attachments:
 - [^hive_cbo_lateral_view_where_bug.sql] - SQL to reproduce
 - Hive log: [^hive_cbo_lateral_view_where_bug.log]

  was:
When CBO is enabled, queries with multiple LATERAL VIEW explode operations and 
WHERE clause filtering return incorrect results. The WHERE clause selectivity 
is not properly applied during optimization, causing the query to return rows 
that should be filtered out.

I can reliably reproduce the bug with 2+ occurrences of "LATERAL VIEW 
explode()" *and* a WHERE clause; it does not seem to happen with only one 
"LATERAL VIEW explode()" or without a WHERE clause.

Originally reproduced with published 4.0.1 container, then verified with 
locally built current master branch.

I was *not* able to reproduce it with unit tests.



{{*Minimal Reproduction Case:*}}
{{  }}
{{  DROP TABLE IF EXISTS cbo_lateral_view_where_bug;}}
{{  CREATE TABLE cbo_lateral_view_where_bug (values array<string>);}}
{{  INSERT INTO cbo_lateral_view_where_bug SELECT array('a', 'b');}}
{{  ANALYZE TABLE cbo_lateral_view_where_bug COMPUTE STATISTICS;}}{{  **  }}

{{  *-- Correct result (CBO OFF)*}}
{{  SET hive.cbo.enable=false;}}
{{  SELECT first_val, second_val}}
{{  FROM cbo_lateral_view_where_bug}}
{{  LATERAL VIEW explode(values) lv1 AS first_val}}
{{  LATERAL VIEW explode(values) lv2 AS second_val}}
{{  WHERE first_val != second_val;}}{{  *-- Incorrect result (CBO ON)*}}
{{  SET hive.cbo.enable=true;}}
{{  SELECT first_val, second_val}}
{{  FROM cbo_lateral_view_where_bug}}
{{  LATERAL VIEW explode(values) lv1 AS first_val}}
{{  LATERAL VIEW explode(values) lv2 AS second_val}}
{{  WHERE first_val != second_val;}}{{  **  }}

{{  *–- Expected vs Actual Results:*}}
{{  - CBO OFF (correct): 2 rows: ('a','b') and ('b','a')}}
{{  - CBO ON (incorrect): 4 rows: ('a','b'), ('b','a'), ('a','a'), ('b','b')}}

  *Root Cause:*
  CBO fails to properly estimate the selectivity of WHERE clauses that filter 
cross-products created by multiple lateral views on the same array. The 
optimizer appears to ignore or miscalculate the filtering effect, allowing rows 
that should be excluded.

  *Impact:*
  - Data Correctness: Queries return wrong results, not just performance issues
  - Scope: Requires 2+ LATERAL VIEW explode operations with WHERE filtering
  - Severity: Silent data corruption - users may not notice incorrect results

  {*}Workaround{*}:
Set {*}hive.cbo.enable{*}=false for queries with multiple lateral view 
cross-products and WHERE clause filtering.

 

Attachments: 
- [^hive_cbo_lateral_view_where_bug.sql] - SQL to reproduce
- Hive log: [^hive_cbo_lateral_view_where_bug.log]


> CBO returns incorrect results when WHERE clause filters multiple LATERAL VIEW 
> explode cross-products
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-29084
>                 URL: https://issues.apache.org/jira/browse/HIVE-29084
>             Project: Hive
>          Issue Type: Bug
>          Components: CBO
>    Affects Versions: 4.1.0, 4.0.1, 4.2.0
>         Environment:   - Affected Versions: Hive 4.0.1, 4.2.0-SNAPSHOT 
> (likely all CBO+lateral view versions)
>   - Execution Engine: MapReduce (local mode)
>   - Metastore: Derby embedded
>            Reporter: Konstantin Bereznyakov
>            Assignee: Konstantin Bereznyakov
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: HIVE-29084-ASTConverter.patch, 
> HIVE-29084-after-fix-filter-schema-different-table-aliases.png, 
> HIVE-29084-before-fix-filter-schema-different-table-aliases.png, 
> hive-29084.POC.patch, hive_29084.q, hive_29084.q.out, 
> hive_cbo_lateral_view_where_bug.log, hive_cbo_lateral_view_where_bug.sql
>
>
> When CBO is enabled, queries with multiple LATERAL VIEW explode operations 
> and WHERE clause filtering return incorrect results. WHERE clause conditions 
> could be dropped during post-Calcite PPD processing, causing the query to 
> return rows that should be filtered out.
> I can reliably reproduce the bug with 2+ occurrences of "LATERAL VIEW 
> explode()" *and* a WHERE clause; it does not seem to happen with only one 
> "LATERAL VIEW explode()" or without a WHERE clause.
> Originally reproduced with published 4.0.1 container, then verified with 
> locally built current master branch.
> I was *not* able to reproduce it with unit tests.
> {{*Minimal Reproduction Case:*}}
> {{  }}
> {{  DROP TABLE IF EXISTS cbo_lateral_view_where_bug;}}
> {{  CREATE TABLE cbo_lateral_view_where_bug (values array<string>);}}
> {{  INSERT INTO cbo_lateral_view_where_bug SELECT array('a', 'b');}}
> {{{}  ANALYZE TABLE cbo_lateral_view_where_bug COMPUTE STATISTICS;{}}}{\{  ** 
>  }}
> {{  *-- Correct result (CBO OFF)*}}
> {{  SET hive.cbo.enable=false;}}
> {{  SELECT first_val, second_val}}
> {{  FROM cbo_lateral_view_where_bug}}
> {{  LATERAL VIEW explode(values) lv1 AS first_val}}
> {{  LATERAL VIEW explode(values) lv2 AS second_val}}
> {{  WHERE first_val != second_val;}}{{  *-- Incorrect result (CBO ON)*}}
> {{  SET hive.cbo.enable=true;}}
> {{  SELECT first_val, second_val}}
> {{  FROM cbo_lateral_view_where_bug}}
> {{  LATERAL VIEW explode(values) lv1 AS first_val}}
> {{  LATERAL VIEW explode(values) lv2 AS second_val}}
> {{{}  WHERE first_val != second_val;{}}}{\{  **  }}
> {{  *–- Expected vs Actual Results:*}}
> {{  - CBO OFF (correct): 2 rows: ('a','b') and ('b','a')}}
> {{  - CBO ON (incorrect): 4 rows: ('a','b'), ('b','a'), ('a','a'), ('b','b')}}
>   *Root Cause:*
>   ASTConverter's LV processing assigns the same tableAlias value to all 
> columns of the query, even ones generated as LATERAL VIEW. This causes 
> certain conditions (OR, "!=") between columns of different LV statements to 
> be "incorrectly pushable" during PPD processing
>   *Impact:*
>   - Data Correctness: Queries return wrong results, not just performance 
> issues
>   - Scope: Requires 2+ LATERAL VIEW explode operations with WHERE filtering
>   - Severity: Silent data corruption - users may not notice incorrect results
>   {*}Workaround{*}:
> Set {*}hive.cbo.enable{*}=false for queries with multiple lateral view 
> cross-products and WHERE clause filtering. Disabling PPD also seems to bypass 
> the problem.
>  
> Attachments:
>  - [^hive_cbo_lateral_view_where_bug.sql] - SQL to reproduce
>  - Hive log: [^hive_cbo_lateral_view_where_bug.log]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to