[jira] [Work logged] (HIVE-25758) OOM due to recursive application of CBO rules

2022-04-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25758?focusedWorklogId=763340&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-763340
 ]

ASF GitHub Bot logged work on HIVE-25758:
-

Author: ASF GitHub Bot
Created on: 28/Apr/22 08:01
Start Date: 28/Apr/22 08:01
Worklog Time Spent: 10m 
  Work Description: pvary merged PR #3252:
URL: https://github.com/apache/hive/pull/3252




Issue Time Tracking
---

Worklog Id: (was: 763340)
Time Spent: 4h 10m  (was: 4h)

> OOM due to recursive application of CBO rules
> -
>
> Key: HIVE-25758
> URL: https://issues.apache.org/jira/browse/HIVE-25758
> Project: Hive
>  Issue Type: Bug
>  Components: CBO, Query Planning
>Affects Versions: 4.0.0
>Reporter: Alessandro Solimando
>Assignee: Alessandro Solimando
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0-alpha-2
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
>  
> Reproducing query is as follows:
> {code:java}
> create table test1 (act_nbr string);
> create table test2 (month int);
> create table test3 (mth int, con_usd double);
> EXPLAIN
>SELECT c.month,
>   d.con_usd
>FROM
>  (SELECT 
> cast(regexp_replace(substr(add_months(from_unixtime(unix_timestamp(), 
> '-MM-dd'), -1), 1, 7), '-', '') AS int) AS month
>   FROM test1
>   UNION ALL
>   SELECT month
>   FROM test2
>   WHERE month = 202110) c
>JOIN test3 d ON c.month = d.mth; {code}
>  
> Different plans are generated during the first CBO steps, last being:
> {noformat}
> 2021-12-01T08:28:08,598 DEBUG [a18191bb-3a2b-4193-9abf-4e37dd1996bb main] 
> parse.CalcitePlanner: Plan after decorre
> lation:
> HiveProject(month=[$0], con_usd=[$2])
>   HiveJoin(condition=[=($0, $1)], joinType=[inner], algorithm=[none], 
> cost=[not available])
>     HiveProject(month=[$0])
>       HiveUnion(all=[true])
>         
> HiveProject(month=[CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP,
>  _UTF-16LE'-MM-d
> d':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 7), 
> _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-
> 16LE", _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER])
>           HiveTableScan(table=[[default, test1]], table:alias=[test1])
>         HiveProject(month=[$0])
>           HiveFilter(condition=[=($0, CAST(202110):INTEGER)])
>             HiveTableScan(table=[[default, test2]], table:alias=[test2])
>     HiveTableScan(table=[[default, test3]], table:alias=[d]){noformat}
>  
> Then, the HEP planner will keep expanding the filter expression with 
> redundant expressions, such as the following, where the identical CAST 
> expression is present multiple times:
>  
> {noformat}
> rel#118:HiveFilter.HIVE.[].any(input=HepRelVertex#39,condition=IN(CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP,
>  _UTF-16LE'-MM-dd':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 
> 7), _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-16LE", 
> _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER, 
> CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP, 
> _UTF-16LE'-MM-dd':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 
> 7), _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-16LE", 
> _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER, 
> 202110)){noformat}
>  
> The problem seems to come from a bad interaction of at least 
> _HiveFilterProjectTransposeRule_ and 
> {_}HiveJoinPushTransitivePredicatesRule{_}, possibly more.
> Most probably then UNION part can be removed and the reproducer be simplified 
> even further.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Work logged] (HIVE-25758) OOM due to recursive application of CBO rules

2022-04-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25758?focusedWorklogId=763335&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-763335
 ]

ASF GitHub Bot logged work on HIVE-25758:
-

Author: ASF GitHub Bot
Created on: 28/Apr/22 07:40
Start Date: 28/Apr/22 07:40
Worklog Time Spent: 10m 
  Work Description: asolimando opened a new pull request, #3252:
URL: https://github.com/apache/hive/pull/3252

   Fixing broken javadoc
   
   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   




Issue Time Tracking
---

Worklog Id: (was: 763335)
Time Spent: 4h  (was: 3h 50m)

> OOM due to recursive application of CBO rules
> -
>
> Key: HIVE-25758
> URL: https://issues.apache.org/jira/browse/HIVE-25758
> Project: Hive
>  Issue Type: Bug
>  Components: CBO, Query Planning
>Affects Versions: 4.0.0
>Reporter: Alessandro Solimando
>Assignee: Alessandro Solimando
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0-alpha-2
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
>  
> Reproducing query is as follows:
> {code:java}
> create table test1 (act_nbr string);
> create table test2 (month int);
> create table test3 (mth int, con_usd double);
> EXPLAIN
>SELECT c.month,
>   d.con_usd
>FROM
>  (SELECT 
> cast(regexp_replace(substr(add_months(from_unixtime(unix_timestamp(), 
> '-MM-dd'), -1), 1, 7), '-', '') AS int) AS month
>   FROM test1
>   UNION ALL
>   SELECT month
>   FROM test2
>   WHERE month = 202110) c
>JOIN test3 d ON c.month = d.mth; {code}
>  
> Different plans are generated during the first CBO steps, last being:
> {noformat}
> 2021-12-01T08:28:08,598 DEBUG [a18191bb-3a2b-4193-9abf-4e37dd1996bb main] 
> parse.CalcitePlanner: Plan after decorre
> lation:
> HiveProject(month=[$0], con_usd=[$2])
>   HiveJoin(condition=[=($0, $1)], joinType=[inner], algorithm=[none], 
> cost=[not available])
>     HiveProject(month=[$0])
>       HiveUnion(all=[true])
>         
> HiveProject(month=[CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP,
>  _UTF-16LE'-MM-d
> d':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 7), 
> _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-
> 16LE", _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER])
>           HiveTableScan(table=[[default, test1]], table:alias=[test1])
>         HiveProject(month=[$0])
>           HiveFilter(condition=[=($0, CAST(202110):INTEGER)])
>             HiveTableScan(table=[[default, test2]], table:alias=[test2])
>     HiveTableScan(table=[[default, test3]], table:alias=[d]){noformat}
>  
> Then, the HEP planner will keep expanding the filter expression with 
> redundant expressions, such as the following, where the identical CAST 
> expression is present multiple times:
>  
> {noformat}
> rel#118:HiveFilter.HIVE.[].any(input=HepRelVertex#39,condition=IN(CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP,
>  _UTF-16LE'-MM-dd':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 
> 7), _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-16LE", 
> _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER, 
> CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP, 
> _UTF-16LE'-MM-dd':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 
> 7), _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-16LE", 
> _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER, 
> 202110)){noformat}
>  
> The problem seems to come from a bad interaction of at least 
> _HiveFilterProjectTransposeRule_ and 
> {_}HiveJoinPushTransitivePredicatesRule{_}, possibly more.
> Most probably then UNION part can be removed and the reproducer be simplified 
> even further.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Work logged] (HIVE-25758) OOM due to recursive application of CBO rules

2022-04-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25758?focusedWorklogId=762756&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-762756
 ]

ASF GitHub Bot logged work on HIVE-25758:
-

Author: ASF GitHub Bot
Created on: 27/Apr/22 08:45
Start Date: 27/Apr/22 08:45
Worklog Time Spent: 10m 
  Work Description: zabetak closed pull request #2966: HIVE-25758: OOM due 
to recursive application of CBO rules
URL: https://github.com/apache/hive/pull/2966




Issue Time Tracking
---

Worklog Id: (was: 762756)
Time Spent: 3h 50m  (was: 3h 40m)

> OOM due to recursive application of CBO rules
> -
>
> Key: HIVE-25758
> URL: https://issues.apache.org/jira/browse/HIVE-25758
> Project: Hive
>  Issue Type: Bug
>  Components: CBO, Query Planning
>Affects Versions: 4.0.0
>Reporter: Alessandro Solimando
>Assignee: Alessandro Solimando
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
>  
> Reproducing query is as follows:
> {code:java}
> create table test1 (act_nbr string);
> create table test2 (month int);
> create table test3 (mth int, con_usd double);
> EXPLAIN
>SELECT c.month,
>   d.con_usd
>FROM
>  (SELECT 
> cast(regexp_replace(substr(add_months(from_unixtime(unix_timestamp(), 
> '-MM-dd'), -1), 1, 7), '-', '') AS int) AS month
>   FROM test1
>   UNION ALL
>   SELECT month
>   FROM test2
>   WHERE month = 202110) c
>JOIN test3 d ON c.month = d.mth; {code}
>  
> Different plans are generated during the first CBO steps, last being:
> {noformat}
> 2021-12-01T08:28:08,598 DEBUG [a18191bb-3a2b-4193-9abf-4e37dd1996bb main] 
> parse.CalcitePlanner: Plan after decorre
> lation:
> HiveProject(month=[$0], con_usd=[$2])
>   HiveJoin(condition=[=($0, $1)], joinType=[inner], algorithm=[none], 
> cost=[not available])
>     HiveProject(month=[$0])
>       HiveUnion(all=[true])
>         
> HiveProject(month=[CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP,
>  _UTF-16LE'-MM-d
> d':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 7), 
> _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-
> 16LE", _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER])
>           HiveTableScan(table=[[default, test1]], table:alias=[test1])
>         HiveProject(month=[$0])
>           HiveFilter(condition=[=($0, CAST(202110):INTEGER)])
>             HiveTableScan(table=[[default, test2]], table:alias=[test2])
>     HiveTableScan(table=[[default, test3]], table:alias=[d]){noformat}
>  
> Then, the HEP planner will keep expanding the filter expression with 
> redundant expressions, such as the following, where the identical CAST 
> expression is present multiple times:
>  
> {noformat}
> rel#118:HiveFilter.HIVE.[].any(input=HepRelVertex#39,condition=IN(CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP,
>  _UTF-16LE'-MM-dd':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 
> 7), _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-16LE", 
> _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER, 
> CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP, 
> _UTF-16LE'-MM-dd':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 
> 7), _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-16LE", 
> _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER, 
> 202110)){noformat}
>  
> The problem seems to come from a bad interaction of at least 
> _HiveFilterProjectTransposeRule_ and 
> {_}HiveJoinPushTransitivePredicatesRule{_}, possibly more.
> Most probably then UNION part can be removed and the reproducer be simplified 
> even further.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Work logged] (HIVE-25758) OOM due to recursive application of CBO rules

2022-04-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25758?focusedWorklogId=762432&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-762432
 ]

ASF GitHub Bot logged work on HIVE-25758:
-

Author: ASF GitHub Bot
Created on: 26/Apr/22 17:13
Start Date: 26/Apr/22 17:13
Worklog Time Spent: 10m 
  Work Description: asolimando commented on PR #2966:
URL: https://github.com/apache/hive/pull/2966#issuecomment-1110049611

   Tests are finally green, is there anything else that is needed @zabetak?




Issue Time Tracking
---

Worklog Id: (was: 762432)
Time Spent: 3h 40m  (was: 3.5h)

> OOM due to recursive application of CBO rules
> -
>
> Key: HIVE-25758
> URL: https://issues.apache.org/jira/browse/HIVE-25758
> Project: Hive
>  Issue Type: Bug
>  Components: CBO, Query Planning
>Affects Versions: 4.0.0
>Reporter: Alessandro Solimando
>Assignee: Alessandro Solimando
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
>  
> Reproducing query is as follows:
> {code:java}
> create table test1 (act_nbr string);
> create table test2 (month int);
> create table test3 (mth int, con_usd double);
> EXPLAIN
>SELECT c.month,
>   d.con_usd
>FROM
>  (SELECT 
> cast(regexp_replace(substr(add_months(from_unixtime(unix_timestamp(), 
> '-MM-dd'), -1), 1, 7), '-', '') AS int) AS month
>   FROM test1
>   UNION ALL
>   SELECT month
>   FROM test2
>   WHERE month = 202110) c
>JOIN test3 d ON c.month = d.mth; {code}
>  
> Different plans are generated during the first CBO steps, last being:
> {noformat}
> 2021-12-01T08:28:08,598 DEBUG [a18191bb-3a2b-4193-9abf-4e37dd1996bb main] 
> parse.CalcitePlanner: Plan after decorre
> lation:
> HiveProject(month=[$0], con_usd=[$2])
>   HiveJoin(condition=[=($0, $1)], joinType=[inner], algorithm=[none], 
> cost=[not available])
>     HiveProject(month=[$0])
>       HiveUnion(all=[true])
>         
> HiveProject(month=[CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP,
>  _UTF-16LE'-MM-d
> d':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 7), 
> _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-
> 16LE", _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER])
>           HiveTableScan(table=[[default, test1]], table:alias=[test1])
>         HiveProject(month=[$0])
>           HiveFilter(condition=[=($0, CAST(202110):INTEGER)])
>             HiveTableScan(table=[[default, test2]], table:alias=[test2])
>     HiveTableScan(table=[[default, test3]], table:alias=[d]){noformat}
>  
> Then, the HEP planner will keep expanding the filter expression with 
> redundant expressions, such as the following, where the identical CAST 
> expression is present multiple times:
>  
> {noformat}
> rel#118:HiveFilter.HIVE.[].any(input=HepRelVertex#39,condition=IN(CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP,
>  _UTF-16LE'-MM-dd':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 
> 7), _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-16LE", 
> _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER, 
> CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP, 
> _UTF-16LE'-MM-dd':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 
> 7), _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-16LE", 
> _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER, 
> 202110)){noformat}
>  
> The problem seems to come from a bad interaction of at least 
> _HiveFilterProjectTransposeRule_ and 
> {_}HiveJoinPushTransitivePredicatesRule{_}, possibly more.
> Most probably then UNION part can be removed and the reproducer be simplified 
> even further.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Work logged] (HIVE-25758) OOM due to recursive application of CBO rules

2022-04-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25758?focusedWorklogId=761857&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-761857
 ]

ASF GitHub Bot logged work on HIVE-25758:
-

Author: ASF GitHub Bot
Created on: 25/Apr/22 15:29
Start Date: 25/Apr/22 15:29
Worklog Time Spent: 10m 
  Work Description: asolimando commented on code in PR #2966:
URL: https://github.com/apache/hive/pull/2966#discussion_r857758546


##
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveCalciteUtil.java:
##
@@ -1214,6 +1214,58 @@ public FixNullabilityShuttle(RexBuilder rexBuilder,
 }
   }
 
+  /**
+   * Find disjunction (OR) in an expression (at any level of nesting).
+   *
+   * Example 1: OR(=($0, $1), IS NOT NULL($2))):INTEGER (OR in the top-level 
expression)
+   * Example 2: NOT(AND(=($0, $1), IS NOT NULL($2))
+   *   this is equivalent to OR((<>($0, $1), IS NULL($2))
+   * Example 3: AND(OR(=($0, $1), IS NOT NULL($2 (OR in inner expression)
+   */

Review Comment:
   > I had in mind compiling with -Pjavadoc profile and checking for new errors 
in this class. Actually I am afraid of <> symbols as well as $. Don't remember 
if it is fine to use them like that.
   
   Indeed '<' and '>' are illegal, the rest did not highlight any issue when 
compiling with `-Pjavadoc`. I took the chance to improve the formatting of the 
list of examples. 





Issue Time Tracking
---

Worklog Id: (was: 761857)
Time Spent: 3.5h  (was: 3h 20m)

> OOM due to recursive application of CBO rules
> -
>
> Key: HIVE-25758
> URL: https://issues.apache.org/jira/browse/HIVE-25758
> Project: Hive
>  Issue Type: Bug
>  Components: CBO, Query Planning
>Affects Versions: 4.0.0
>Reporter: Alessandro Solimando
>Assignee: Alessandro Solimando
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
>  
> Reproducing query is as follows:
> {code:java}
> create table test1 (act_nbr string);
> create table test2 (month int);
> create table test3 (mth int, con_usd double);
> EXPLAIN
>SELECT c.month,
>   d.con_usd
>FROM
>  (SELECT 
> cast(regexp_replace(substr(add_months(from_unixtime(unix_timestamp(), 
> '-MM-dd'), -1), 1, 7), '-', '') AS int) AS month
>   FROM test1
>   UNION ALL
>   SELECT month
>   FROM test2
>   WHERE month = 202110) c
>JOIN test3 d ON c.month = d.mth; {code}
>  
> Different plans are generated during the first CBO steps, last being:
> {noformat}
> 2021-12-01T08:28:08,598 DEBUG [a18191bb-3a2b-4193-9abf-4e37dd1996bb main] 
> parse.CalcitePlanner: Plan after decorre
> lation:
> HiveProject(month=[$0], con_usd=[$2])
>   HiveJoin(condition=[=($0, $1)], joinType=[inner], algorithm=[none], 
> cost=[not available])
>     HiveProject(month=[$0])
>       HiveUnion(all=[true])
>         
> HiveProject(month=[CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP,
>  _UTF-16LE'-MM-d
> d':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 7), 
> _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-
> 16LE", _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER])
>           HiveTableScan(table=[[default, test1]], table:alias=[test1])
>         HiveProject(month=[$0])
>           HiveFilter(condition=[=($0, CAST(202110):INTEGER)])
>             HiveTableScan(table=[[default, test2]], table:alias=[test2])
>     HiveTableScan(table=[[default, test3]], table:alias=[d]){noformat}
>  
> Then, the HEP planner will keep expanding the filter expression with 
> redundant expressions, such as the following, where the identical CAST 
> expression is present multiple times:
>  
> {noformat}
> rel#118:HiveFilter.HIVE.[].any(input=HepRelVertex#39,condition=IN(CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP,
>  _UTF-16LE'-MM-dd':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 
> 7), _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-16LE", 
> _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER, 
> CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP, 
> _UTF-16LE'-MM-dd':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 
> 7), _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-16LE", 
> _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER, 
> 202110)){noformat}
>  
> The problem seems to come from a bad interaction of at least 
> _HiveFilterProjectTransposeRule_ and 
> {_}HiveJoinPushTransitivePredicatesRule{_}, possibly more.
> Most probably then UNION part can be removed and the reproducer be simplified 
> even further.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Work logged] (HIVE-25758) OOM due to recursive application of CBO rules

2022-04-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25758?focusedWorklogId=761845&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-761845
 ]

ASF GitHub Bot logged work on HIVE-25758:
-

Author: ASF GitHub Bot
Created on: 25/Apr/22 15:16
Start Date: 25/Apr/22 15:16
Worklog Time Spent: 10m 
  Work Description: zabetak commented on code in PR #2966:
URL: https://github.com/apache/hive/pull/2966#discussion_r857739177


##
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveCalciteUtil.java:
##
@@ -1257,15 +1249,26 @@ public Void visitCall(RexCall call) {
 }
 
 public boolean hasDisjunction(RexNode node) {
-  // clear the state
-  inNegation = false;
-  hasDisjunction = false;
-
   node.accept(this);
   return hasDisjunction;
 }
   }
 
+  /**
+   * Find disjunction (OR) in an expression (at any level of nesting).

Review Comment:
   `Find` implies that we are going to return a disjunctive expression. 
Suggestion:
   `Returns whether the expression has disjunctions (OR) at any level of 
nesting.`



##
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveCalciteUtil.java:
##
@@ -1257,15 +1249,26 @@ public Void visitCall(RexCall call) {
 }
 
 public boolean hasDisjunction(RexNode node) {
-  // clear the state
-  inNegation = false;
-  hasDisjunction = false;
-
   node.accept(this);
   return hasDisjunction;
 }
   }
 
+  /**
+   * Find disjunction (OR) in an expression (at any level of nesting).
+   *
+   * Example 1: OR(=($0, $1), IS NOT NULL($2))):INTEGER (OR in the top-level 
expression)
+   * Example 2: NOT(AND(=($0, $1), IS NOT NULL($2))
+   *   this is equivalent to OR((<>($0, $1), IS NULL($2))
+   * Example 3: AND(OR(=($0, $1), IS NOT NULL($2 (OR in inner expression)
+   *
+   * @param node the expression where to look for disjunctions.
+   * @return true if the given expressions contains a disjunction, false 
otherwise.
+   */
+  public static boolean hasDisjuction(RexNode node) {
+return new DisjunctivePredicatesFinder().hasDisjunction(node);

Review Comment:
   nit to get rid of the `DisjunctivePredicateFinder#hasDisjunction` method:
   ```
   DisjunctivePredicateFinder finder = new DisjunctivePredicateFinder();
   node.accept(finder);
   return finder.hasDisjunction;
   ```



##
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveCalciteUtil.java:
##
@@ -1214,6 +1214,58 @@ public FixNullabilityShuttle(RexBuilder rexBuilder,
 }
   }
 
+  /**
+   * Find disjunction (OR) in an expression (at any level of nesting).
+   *
+   * Example 1: OR(=($0, $1), IS NOT NULL($2))):INTEGER (OR in the top-level 
expression)
+   * Example 2: NOT(AND(=($0, $1), IS NOT NULL($2))
+   *   this is equivalent to OR((<>($0, $1), IS NULL($2))
+   * Example 3: AND(OR(=($0, $1), IS NOT NULL($2 (OR in inner expression)
+   */

Review Comment:
   I had in mind compiling with `-Pjavadoc` profile and checking for new errors 
in this class. Actually I am afraid of `<>` symbols as well as `$`. Don't 
remember if it is fine to use them like that.





Issue Time Tracking
---

Worklog Id: (was: 761845)
Time Spent: 3h 20m  (was: 3h 10m)

> OOM due to recursive application of CBO rules
> -
>
> Key: HIVE-25758
> URL: https://issues.apache.org/jira/browse/HIVE-25758
> Project: Hive
>  Issue Type: Bug
>  Components: CBO, Query Planning
>Affects Versions: 4.0.0
>Reporter: Alessandro Solimando
>Assignee: Alessandro Solimando
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
>  
> Reproducing query is as follows:
> {code:java}
> create table test1 (act_nbr string);
> create table test2 (month int);
> create table test3 (mth int, con_usd double);
> EXPLAIN
>SELECT c.month,
>   d.con_usd
>FROM
>  (SELECT 
> cast(regexp_replace(substr(add_months(from_unixtime(unix_timestamp(), 
> '-MM-dd'), -1), 1, 7), '-', '') AS int) AS month
>   FROM test1
>   UNION ALL
>   SELECT month
>   FROM test2
>   WHERE month = 202110) c
>JOIN test3 d ON c.month = d.mth; {code}
>  
> Different plans are generated during the first CBO steps, last being:
> {noformat}
> 2021-12-01T08:28:08,598 DEBUG [a18191bb-3a2b-4193-9abf-4e37dd1996bb main] 
> parse.CalcitePlanner: Plan after decorre
> lation:
> HiveProject(month=[$0], con_usd=[$2])
>   HiveJoin(condition=[=($0, $1)], joinType=[inner], algorithm=[none], 
> cost=[not available])
>     HiveProject(month=[$0])
>       HiveUnion(all=[true])
>         
> HiveProject(month=[CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP,
>  _UTF-16LE'-MM-d
> d':VARCHAR(21

[jira] [Work logged] (HIVE-25758) OOM due to recursive application of CBO rules

2022-04-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25758?focusedWorklogId=761831&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-761831
 ]

ASF GitHub Bot logged work on HIVE-25758:
-

Author: ASF GitHub Bot
Created on: 25/Apr/22 14:50
Start Date: 25/Apr/22 14:50
Worklog Time Spent: 10m 
  Work Description: asolimando commented on code in PR #2966:
URL: https://github.com/apache/hive/pull/2966#discussion_r857716612


##
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveCalciteUtil.java:
##
@@ -1214,6 +1214,58 @@ public FixNullabilityShuttle(RexBuilder rexBuilder,
 }
   }
 
+  /**
+   * Find disjunction (OR) in an expression (at any level of nesting).
+   *
+   * Example 1: OR(=($0, $1), IS NOT NULL($2))):INTEGER (OR in the top-level 
expression)
+   * Example 2: NOT(AND(=($0, $1), IS NOT NULL($2))
+   *   this is equivalent to OR((<>($0, $1), IS NULL($2))
+   * Example 3: AND(OR(=($0, $1), IS NOT NULL($2 (OR in inner expression)
+   */

Review Comment:
   I get no errors while compiling (without `-Dmaven.javadoc.skip` of course).
   
   Is there anything specific you are referring to?





Issue Time Tracking
---

Worklog Id: (was: 761831)
Time Spent: 3h 10m  (was: 3h)

> OOM due to recursive application of CBO rules
> -
>
> Key: HIVE-25758
> URL: https://issues.apache.org/jira/browse/HIVE-25758
> Project: Hive
>  Issue Type: Bug
>  Components: CBO, Query Planning
>Affects Versions: 4.0.0
>Reporter: Alessandro Solimando
>Assignee: Alessandro Solimando
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
>  
> Reproducing query is as follows:
> {code:java}
> create table test1 (act_nbr string);
> create table test2 (month int);
> create table test3 (mth int, con_usd double);
> EXPLAIN
>SELECT c.month,
>   d.con_usd
>FROM
>  (SELECT 
> cast(regexp_replace(substr(add_months(from_unixtime(unix_timestamp(), 
> '-MM-dd'), -1), 1, 7), '-', '') AS int) AS month
>   FROM test1
>   UNION ALL
>   SELECT month
>   FROM test2
>   WHERE month = 202110) c
>JOIN test3 d ON c.month = d.mth; {code}
>  
> Different plans are generated during the first CBO steps, last being:
> {noformat}
> 2021-12-01T08:28:08,598 DEBUG [a18191bb-3a2b-4193-9abf-4e37dd1996bb main] 
> parse.CalcitePlanner: Plan after decorre
> lation:
> HiveProject(month=[$0], con_usd=[$2])
>   HiveJoin(condition=[=($0, $1)], joinType=[inner], algorithm=[none], 
> cost=[not available])
>     HiveProject(month=[$0])
>       HiveUnion(all=[true])
>         
> HiveProject(month=[CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP,
>  _UTF-16LE'-MM-d
> d':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 7), 
> _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-
> 16LE", _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER])
>           HiveTableScan(table=[[default, test1]], table:alias=[test1])
>         HiveProject(month=[$0])
>           HiveFilter(condition=[=($0, CAST(202110):INTEGER)])
>             HiveTableScan(table=[[default, test2]], table:alias=[test2])
>     HiveTableScan(table=[[default, test3]], table:alias=[d]){noformat}
>  
> Then, the HEP planner will keep expanding the filter expression with 
> redundant expressions, such as the following, where the identical CAST 
> expression is present multiple times:
>  
> {noformat}
> rel#118:HiveFilter.HIVE.[].any(input=HepRelVertex#39,condition=IN(CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP,
>  _UTF-16LE'-MM-dd':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 
> 7), _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-16LE", 
> _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER, 
> CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP, 
> _UTF-16LE'-MM-dd':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 
> 7), _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-16LE", 
> _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER, 
> 202110)){noformat}
>  
> The problem seems to come from a bad interaction of at least 
> _HiveFilterProjectTransposeRule_ and 
> {_}HiveJoinPushTransitivePredicatesRule{_}, possibly more.
> Most probably then UNION part can be removed and the reproducer be simplified 
> even further.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Work logged] (HIVE-25758) OOM due to recursive application of CBO rules

2022-04-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25758?focusedWorklogId=761816&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-761816
 ]

ASF GitHub Bot logged work on HIVE-25758:
-

Author: ASF GitHub Bot
Created on: 25/Apr/22 14:18
Start Date: 25/Apr/22 14:18
Worklog Time Spent: 10m 
  Work Description: zabetak commented on code in PR #2966:
URL: https://github.com/apache/hive/pull/2966#discussion_r857679191


##
common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:
##
@@ -2547,6 +2547,9 @@ public static enum ConfVars {
 "If this config is true only pushed down filters remain in the 
operator tree, \n" +
 "and the original filter is removed. If this config is false, the 
original filter \n" +
 "is also left in the operator tree at the original place."),
+
HIVE_JOIN_DISJ_TRANSITIVE_PREDICATES_PUSHDOWN("hive.optimize.join.disjunctive.transitive.predicates.pushdown",
+true, "Whether to transitively infer disjunctive predicates across 
joins. \n"
++ "Disjunctive predicates can lead to OOM in transitive 
inference."),

Review Comment:
   Suggestion: Disjunctive predicates are hard to simplify and pushing them 
down may in some cases lead to infinite rule matching causing stackoverflow and 
OOM errors.





Issue Time Tracking
---

Worklog Id: (was: 761816)
Time Spent: 3h  (was: 2h 50m)

> OOM due to recursive application of CBO rules
> -
>
> Key: HIVE-25758
> URL: https://issues.apache.org/jira/browse/HIVE-25758
> Project: Hive
>  Issue Type: Bug
>  Components: CBO, Query Planning
>Affects Versions: 4.0.0
>Reporter: Alessandro Solimando
>Assignee: Alessandro Solimando
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
>  
> Reproducing query is as follows:
> {code:java}
> create table test1 (act_nbr string);
> create table test2 (month int);
> create table test3 (mth int, con_usd double);
> EXPLAIN
>SELECT c.month,
>   d.con_usd
>FROM
>  (SELECT 
> cast(regexp_replace(substr(add_months(from_unixtime(unix_timestamp(), 
> '-MM-dd'), -1), 1, 7), '-', '') AS int) AS month
>   FROM test1
>   UNION ALL
>   SELECT month
>   FROM test2
>   WHERE month = 202110) c
>JOIN test3 d ON c.month = d.mth; {code}
>  
> Different plans are generated during the first CBO steps, last being:
> {noformat}
> 2021-12-01T08:28:08,598 DEBUG [a18191bb-3a2b-4193-9abf-4e37dd1996bb main] 
> parse.CalcitePlanner: Plan after decorre
> lation:
> HiveProject(month=[$0], con_usd=[$2])
>   HiveJoin(condition=[=($0, $1)], joinType=[inner], algorithm=[none], 
> cost=[not available])
>     HiveProject(month=[$0])
>       HiveUnion(all=[true])
>         
> HiveProject(month=[CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP,
>  _UTF-16LE'-MM-d
> d':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 7), 
> _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-
> 16LE", _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER])
>           HiveTableScan(table=[[default, test1]], table:alias=[test1])
>         HiveProject(month=[$0])
>           HiveFilter(condition=[=($0, CAST(202110):INTEGER)])
>             HiveTableScan(table=[[default, test2]], table:alias=[test2])
>     HiveTableScan(table=[[default, test3]], table:alias=[d]){noformat}
>  
> Then, the HEP planner will keep expanding the filter expression with 
> redundant expressions, such as the following, where the identical CAST 
> expression is present multiple times:
>  
> {noformat}
> rel#118:HiveFilter.HIVE.[].any(input=HepRelVertex#39,condition=IN(CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP,
>  _UTF-16LE'-MM-dd':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 
> 7), _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-16LE", 
> _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER, 
> CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP, 
> _UTF-16LE'-MM-dd':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 
> 7), _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-16LE", 
> _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER, 
> 202110)){noformat}
>  
> The problem seems to come from a bad interaction of at least 
> _HiveFilterProjectTransposeRule_ and 
> {_}HiveJoinPushTransitivePredicatesRule{_}, possibly more.
> Most probably then UNION part can be removed and the reproducer be simplified 
> even further.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Work logged] (HIVE-25758) OOM due to recursive application of CBO rules

2022-04-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25758?focusedWorklogId=761815&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-761815
 ]

ASF GitHub Bot logged work on HIVE-25758:
-

Author: ASF GitHub Bot
Created on: 25/Apr/22 14:14
Start Date: 25/Apr/22 14:14
Worklog Time Spent: 10m 
  Work Description: zabetak commented on code in PR #2966:
URL: https://github.com/apache/hive/pull/2966#discussion_r857675116


##
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveCalciteUtil.java:
##
@@ -1214,6 +1214,58 @@ public FixNullabilityShuttle(RexBuilder rexBuilder,
 }
   }
 
+  /**
+   * Find disjunction (OR) in an expression (at any level of nesting).
+   *
+   * Example 1: OR(=($0, $1), IS NOT NULL($2))):INTEGER (OR in the top-level 
expression)
+   * Example 2: NOT(AND(=($0, $1), IS NOT NULL($2))
+   *   this is equivalent to OR((<>($0, $1), IS NULL($2))
+   * Example 3: AND(OR(=($0, $1), IS NOT NULL($2 (OR in inner expression)
+   */

Review Comment:
   Please ensure that we do not have illegal characters in the javadoc; i.e., 
we don't generate new javadoc errors.





Issue Time Tracking
---

Worklog Id: (was: 761815)
Time Spent: 2h 50m  (was: 2h 40m)

> OOM due to recursive application of CBO rules
> -
>
> Key: HIVE-25758
> URL: https://issues.apache.org/jira/browse/HIVE-25758
> Project: Hive
>  Issue Type: Bug
>  Components: CBO, Query Planning
>Affects Versions: 4.0.0
>Reporter: Alessandro Solimando
>Assignee: Alessandro Solimando
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
>  
> Reproducing query is as follows:
> {code:java}
> create table test1 (act_nbr string);
> create table test2 (month int);
> create table test3 (mth int, con_usd double);
> EXPLAIN
>SELECT c.month,
>   d.con_usd
>FROM
>  (SELECT 
> cast(regexp_replace(substr(add_months(from_unixtime(unix_timestamp(), 
> '-MM-dd'), -1), 1, 7), '-', '') AS int) AS month
>   FROM test1
>   UNION ALL
>   SELECT month
>   FROM test2
>   WHERE month = 202110) c
>JOIN test3 d ON c.month = d.mth; {code}
>  
> Different plans are generated during the first CBO steps, last being:
> {noformat}
> 2021-12-01T08:28:08,598 DEBUG [a18191bb-3a2b-4193-9abf-4e37dd1996bb main] 
> parse.CalcitePlanner: Plan after decorre
> lation:
> HiveProject(month=[$0], con_usd=[$2])
>   HiveJoin(condition=[=($0, $1)], joinType=[inner], algorithm=[none], 
> cost=[not available])
>     HiveProject(month=[$0])
>       HiveUnion(all=[true])
>         
> HiveProject(month=[CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP,
>  _UTF-16LE'-MM-d
> d':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 7), 
> _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-
> 16LE", _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER])
>           HiveTableScan(table=[[default, test1]], table:alias=[test1])
>         HiveProject(month=[$0])
>           HiveFilter(condition=[=($0, CAST(202110):INTEGER)])
>             HiveTableScan(table=[[default, test2]], table:alias=[test2])
>     HiveTableScan(table=[[default, test3]], table:alias=[d]){noformat}
>  
> Then, the HEP planner will keep expanding the filter expression with 
> redundant expressions, such as the following, where the identical CAST 
> expression is present multiple times:
>  
> {noformat}
> rel#118:HiveFilter.HIVE.[].any(input=HepRelVertex#39,condition=IN(CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP,
>  _UTF-16LE'-MM-dd':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 
> 7), _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-16LE", 
> _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER, 
> CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP, 
> _UTF-16LE'-MM-dd':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 
> 7), _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-16LE", 
> _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER, 
> 202110)){noformat}
>  
> The problem seems to come from a bad interaction of at least 
> _HiveFilterProjectTransposeRule_ and 
> {_}HiveJoinPushTransitivePredicatesRule{_}, possibly more.
> Most probably then UNION part can be removed and the reproducer be simplified 
> even further.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Work logged] (HIVE-25758) OOM due to recursive application of CBO rules

2022-04-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25758?focusedWorklogId=761813&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-761813
 ]

ASF GitHub Bot logged work on HIVE-25758:
-

Author: ASF GitHub Bot
Created on: 25/Apr/22 14:12
Start Date: 25/Apr/22 14:12
Worklog Time Spent: 10m 
  Work Description: zabetak commented on code in PR #2966:
URL: https://github.com/apache/hive/pull/2966#discussion_r857673196


##
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveCalciteUtil.java:
##
@@ -1214,6 +1214,58 @@ public FixNullabilityShuttle(RexBuilder rexBuilder,
 }
   }
 
+  /**
+   * Find disjunction (OR) in an expression (at any level of nesting).
+   *
+   * Example 1: OR(=($0, $1), IS NOT NULL($2))):INTEGER (OR in the top-level 
expression)
+   * Example 2: NOT(AND(=($0, $1), IS NOT NULL($2))
+   *   this is equivalent to OR((<>($0, $1), IS NULL($2))
+   * Example 3: AND(OR(=($0, $1), IS NOT NULL($2 (OR in inner expression)
+   */
+  public static class DisjunctivePredicatesFinder extends RexVisitorImpl 
{

Review Comment:
   How about making this class private and expose only a method? `public static 
hasDisjunction`?





Issue Time Tracking
---

Worklog Id: (was: 761813)
Time Spent: 2h 40m  (was: 2.5h)

> OOM due to recursive application of CBO rules
> -
>
> Key: HIVE-25758
> URL: https://issues.apache.org/jira/browse/HIVE-25758
> Project: Hive
>  Issue Type: Bug
>  Components: CBO, Query Planning
>Affects Versions: 4.0.0
>Reporter: Alessandro Solimando
>Assignee: Alessandro Solimando
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
>  
> Reproducing query is as follows:
> {code:java}
> create table test1 (act_nbr string);
> create table test2 (month int);
> create table test3 (mth int, con_usd double);
> EXPLAIN
>SELECT c.month,
>   d.con_usd
>FROM
>  (SELECT 
> cast(regexp_replace(substr(add_months(from_unixtime(unix_timestamp(), 
> '-MM-dd'), -1), 1, 7), '-', '') AS int) AS month
>   FROM test1
>   UNION ALL
>   SELECT month
>   FROM test2
>   WHERE month = 202110) c
>JOIN test3 d ON c.month = d.mth; {code}
>  
> Different plans are generated during the first CBO steps, last being:
> {noformat}
> 2021-12-01T08:28:08,598 DEBUG [a18191bb-3a2b-4193-9abf-4e37dd1996bb main] 
> parse.CalcitePlanner: Plan after decorre
> lation:
> HiveProject(month=[$0], con_usd=[$2])
>   HiveJoin(condition=[=($0, $1)], joinType=[inner], algorithm=[none], 
> cost=[not available])
>     HiveProject(month=[$0])
>       HiveUnion(all=[true])
>         
> HiveProject(month=[CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP,
>  _UTF-16LE'-MM-d
> d':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 7), 
> _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-
> 16LE", _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER])
>           HiveTableScan(table=[[default, test1]], table:alias=[test1])
>         HiveProject(month=[$0])
>           HiveFilter(condition=[=($0, CAST(202110):INTEGER)])
>             HiveTableScan(table=[[default, test2]], table:alias=[test2])
>     HiveTableScan(table=[[default, test3]], table:alias=[d]){noformat}
>  
> Then, the HEP planner will keep expanding the filter expression with 
> redundant expressions, such as the following, where the identical CAST 
> expression is present multiple times:
>  
> {noformat}
> rel#118:HiveFilter.HIVE.[].any(input=HepRelVertex#39,condition=IN(CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP,
>  _UTF-16LE'-MM-dd':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 
> 7), _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-16LE", 
> _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER, 
> CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP, 
> _UTF-16LE'-MM-dd':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 
> 7), _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-16LE", 
> _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER, 
> 202110)){noformat}
>  
> The problem seems to come from a bad interaction of at least 
> _HiveFilterProjectTransposeRule_ and 
> {_}HiveJoinPushTransitivePredicatesRule{_}, possibly more.
> Most probably then UNION part can be removed and the reproducer be simplified 
> even further.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Work logged] (HIVE-25758) OOM due to recursive application of CBO rules

2022-04-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25758?focusedWorklogId=761778&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-761778
 ]

ASF GitHub Bot logged work on HIVE-25758:
-

Author: ASF GitHub Bot
Created on: 25/Apr/22 13:02
Start Date: 25/Apr/22 13:02
Worklog Time Spent: 10m 
  Work Description: asolimando commented on code in PR #2966:
URL: https://github.com/apache/hive/pull/2966#discussion_r857601445


##
common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:
##
@@ -2547,6 +2547,9 @@ public static enum ConfVars {
 "If this config is true only pushed down filters remain in the 
operator tree, \n" +
 "and the original filter is removed. If this config is false, the 
original filter \n" +
 "is also left in the operator tree at the original place."),
+
HIVE_JOIN_PUSH_TRANSITIVE_PREDICATES_CONSERVATIVE("hive.optimize.join.transitive.predicates.conservative",
+false, "Whether to avoid pushing predicates that are hard to simplify. 
\n"

Review Comment:
   Since we are focusing on the disjunctive predicates (as per another 
comment), let's opt for 
`hive.optimize.join.disjunctive.transitive.predicates.pushdown`.
   
   As discussed offline, let's leave the property disabled by default, because 
it's hard to judge how much benefit/harm there will be for Hive in the wild, we 
can revisit the default value later on if we have more evidence that this will 
help.





Issue Time Tracking
---

Worklog Id: (was: 761778)
Time Spent: 2.5h  (was: 2h 20m)

> OOM due to recursive application of CBO rules
> -
>
> Key: HIVE-25758
> URL: https://issues.apache.org/jira/browse/HIVE-25758
> Project: Hive
>  Issue Type: Bug
>  Components: CBO, Query Planning
>Affects Versions: 4.0.0
>Reporter: Alessandro Solimando
>Assignee: Alessandro Solimando
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
>  
> Reproducing query is as follows:
> {code:java}
> create table test1 (act_nbr string);
> create table test2 (month int);
> create table test3 (mth int, con_usd double);
> EXPLAIN
>SELECT c.month,
>   d.con_usd
>FROM
>  (SELECT 
> cast(regexp_replace(substr(add_months(from_unixtime(unix_timestamp(), 
> '-MM-dd'), -1), 1, 7), '-', '') AS int) AS month
>   FROM test1
>   UNION ALL
>   SELECT month
>   FROM test2
>   WHERE month = 202110) c
>JOIN test3 d ON c.month = d.mth; {code}
>  
> Different plans are generated during the first CBO steps, last being:
> {noformat}
> 2021-12-01T08:28:08,598 DEBUG [a18191bb-3a2b-4193-9abf-4e37dd1996bb main] 
> parse.CalcitePlanner: Plan after decorre
> lation:
> HiveProject(month=[$0], con_usd=[$2])
>   HiveJoin(condition=[=($0, $1)], joinType=[inner], algorithm=[none], 
> cost=[not available])
>     HiveProject(month=[$0])
>       HiveUnion(all=[true])
>         
> HiveProject(month=[CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP,
>  _UTF-16LE'-MM-d
> d':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 7), 
> _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-
> 16LE", _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER])
>           HiveTableScan(table=[[default, test1]], table:alias=[test1])
>         HiveProject(month=[$0])
>           HiveFilter(condition=[=($0, CAST(202110):INTEGER)])
>             HiveTableScan(table=[[default, test2]], table:alias=[test2])
>     HiveTableScan(table=[[default, test3]], table:alias=[d]){noformat}
>  
> Then, the HEP planner will keep expanding the filter expression with 
> redundant expressions, such as the following, where the identical CAST 
> expression is present multiple times:
>  
> {noformat}
> rel#118:HiveFilter.HIVE.[].any(input=HepRelVertex#39,condition=IN(CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP,
>  _UTF-16LE'-MM-dd':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 
> 7), _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-16LE", 
> _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER, 
> CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP, 
> _UTF-16LE'-MM-dd':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 
> 7), _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-16LE", 
> _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER, 
> 202110)){noformat}
>  
> The problem seems to come from a bad interaction of at least 
> _HiveFilterProjectTransposeRule_ and 
> {_}HiveJoinPushTransitivePredicatesRule{_}, possibly more.
> Most probably then UNION part can be removed and the reproducer be simplified 
> even further.
>  



--
This message was sent by Atlass

[jira] [Work logged] (HIVE-25758) OOM due to recursive application of CBO rules

2022-04-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25758?focusedWorklogId=761754&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-761754
 ]

ASF GitHub Bot logged work on HIVE-25758:
-

Author: ASF GitHub Bot
Created on: 25/Apr/22 12:30
Start Date: 25/Apr/22 12:30
Worklog Time Spent: 10m 
  Work Description: asolimando commented on code in PR #2966:
URL: https://github.com/apache/hive/pull/2966#discussion_r857573180


##
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinPushTransitivePredicatesRule.java:
##
@@ -143,28 +138,82 @@ private ImmutableList 
getValidPreds(RelOptCluster cluster, RelNode chil
   }
 }
 
-// We need to filter i) those that have been pushed already as stored in 
the join,
-// and ii) those that were already in the subtree rooted at child
-ImmutableList toPush = 
HiveCalciteUtil.getPredsNotPushedAlready(predicatesToExclude,
-child, valids);
-return toPush;
+// We need to filter:
+//  i) those that have been pushed already as stored in the join,
+//  ii) those that were already in the subtree rooted at child.
+List toPush = 
HiveCalciteUtil.getPredsNotPushedAlready(predicatesToExclude, child, valids);
+
+// If we run the rule in conservative mode, we also filter:
+//  iii) predicates that are not safe for transitive inference.
+//
+// There is no formal definition of safety for predicate inference, only 
an empirical one.
+// An unsafe predicate in this context is one that when pushed across join 
operands, can lead
+// to redundant predicates that cannot be simplified (by means of 
predicates merging with other existing ones).
+// This situation can lead to an OOM for cases where lack of 
simplification allows inferring new predicates
+// (from LHS to RHS and vice-versa) recursively, predicates which are 
redundant, but that RexSimplify cannot handle.
+// This notion can be relaxed as soon as RexSimplify gets more powerful, 
and it can handle such cases.
+if (HiveConf.getBoolVar(conf, 
HiveConf.ConfVars.HIVE_JOIN_PUSH_TRANSITIVE_PREDICATES_CONSERVATIVE)) {
+  toPush = toPush.stream()
+  .filter(unsafeOperatorsFinder::isSafe)
+  .collect(Collectors.toList());
+}
+
+return ImmutableList.copyOf(toPush);
   }
 
-  private RexNode getTypeSafePred(RelOptCluster cluster, RexNode rex, 
RelDataType rType) {
-RexNode typeSafeRex = rex;
-if ((typeSafeRex instanceof RexCall) && 
HiveCalciteUtil.isComparisonOp((RexCall) typeSafeRex)) {
-  RexBuilder rb = cluster.getRexBuilder();
-  List fixedPredElems = new ArrayList();
-  RelDataType commonType = cluster.getTypeFactory().leastRestrictive(
-  RexUtil.types(((RexCall) rex).getOperands()));
-  for (RexNode rn : ((RexCall) rex).getOperands()) {
-fixedPredElems.add(rb.ensureType(commonType, rn, true));
-  }
+  //~ Inner Classes --
+
+  /**
+   * Finds unsafe operators in an expression (at any level of nesting).
+   * At the moment, the only unsafe operator is OR.
+   *
+   * Example 1: OR(=($0, $1), IS NOT NULL($2))):INTEGER (OR in the top-level 
expression)
+   * Example 2: NOT(AND(=($0, $1), IS NOT NULL($2))
+   *   this is equivalent to OR((<>($0, $1), IS NULL($2))
+   * Example 3: AND(OR(=($0, $1), IS NOT NULL($2 (OR in inner expression)
+   */
+  private static class UnsafeOperatorsFinder extends RexVisitorImpl {

Review Comment:
   Agreed, it's better to start with what we have right now, we can always make 
it more generic if needed later on.





Issue Time Tracking
---

Worklog Id: (was: 761754)
Time Spent: 2h 20m  (was: 2h 10m)

> OOM due to recursive application of CBO rules
> -
>
> Key: HIVE-25758
> URL: https://issues.apache.org/jira/browse/HIVE-25758
> Project: Hive
>  Issue Type: Bug
>  Components: CBO, Query Planning
>Affects Versions: 4.0.0
>Reporter: Alessandro Solimando
>Assignee: Alessandro Solimando
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
>  
> Reproducing query is as follows:
> {code:java}
> create table test1 (act_nbr string);
> create table test2 (month int);
> create table test3 (mth int, con_usd double);
> EXPLAIN
>SELECT c.month,
>   d.con_usd
>FROM
>  (SELECT 
> cast(regexp_replace(substr(add_months(from_unixtime(unix_timestamp(), 
> '-MM-dd'), -1), 1, 7), '-', '') AS int) AS month
>   FROM test1
>   UNION ALL
>   SELECT month
>   FROM test2
>   WHERE month = 202110) c
>JOIN test3 d ON c.month = d.mth; {code}
>  
> Different plans are generated during the fi

[jira] [Work logged] (HIVE-25758) OOM due to recursive application of CBO rules

2022-04-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25758?focusedWorklogId=761749&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-761749
 ]

ASF GitHub Bot logged work on HIVE-25758:
-

Author: ASF GitHub Bot
Created on: 25/Apr/22 12:28
Start Date: 25/Apr/22 12:28
Worklog Time Spent: 10m 
  Work Description: asolimando commented on code in PR #2966:
URL: https://github.com/apache/hive/pull/2966#discussion_r857572044


##
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinPushTransitivePredicatesRule.java:
##
@@ -65,21 +66,15 @@
  */
 public class HiveJoinPushTransitivePredicatesRule extends RelOptRule {
 
-  public static final HiveJoinPushTransitivePredicatesRule INSTANCE_JOIN =
-  new HiveJoinPushTransitivePredicatesRule(HiveJoin.class, 
HiveRelFactories.HIVE_FILTER_FACTORY);
-
-  public static final HiveJoinPushTransitivePredicatesRule INSTANCE_SEMIJOIN =
-  new HiveJoinPushTransitivePredicatesRule(HiveSemiJoin.class, 
HiveRelFactories.HIVE_FILTER_FACTORY);
-
-  public static final HiveJoinPushTransitivePredicatesRule INSTANCE_ANTIJOIN =
-  new HiveJoinPushTransitivePredicatesRule(HiveAntiJoin.class, 
HiveRelFactories.HIVE_FILTER_FACTORY);
-
+  private final HiveConf conf;

Review Comment:
   This is a great idea, I am doing that right away.





Issue Time Tracking
---

Worklog Id: (was: 761749)
Time Spent: 2h 10m  (was: 2h)

> OOM due to recursive application of CBO rules
> -
>
> Key: HIVE-25758
> URL: https://issues.apache.org/jira/browse/HIVE-25758
> Project: Hive
>  Issue Type: Bug
>  Components: CBO, Query Planning
>Affects Versions: 4.0.0
>Reporter: Alessandro Solimando
>Assignee: Alessandro Solimando
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
>  
> Reproducing query is as follows:
> {code:java}
> create table test1 (act_nbr string);
> create table test2 (month int);
> create table test3 (mth int, con_usd double);
> EXPLAIN
>SELECT c.month,
>   d.con_usd
>FROM
>  (SELECT 
> cast(regexp_replace(substr(add_months(from_unixtime(unix_timestamp(), 
> '-MM-dd'), -1), 1, 7), '-', '') AS int) AS month
>   FROM test1
>   UNION ALL
>   SELECT month
>   FROM test2
>   WHERE month = 202110) c
>JOIN test3 d ON c.month = d.mth; {code}
>  
> Different plans are generated during the first CBO steps, last being:
> {noformat}
> 2021-12-01T08:28:08,598 DEBUG [a18191bb-3a2b-4193-9abf-4e37dd1996bb main] 
> parse.CalcitePlanner: Plan after decorre
> lation:
> HiveProject(month=[$0], con_usd=[$2])
>   HiveJoin(condition=[=($0, $1)], joinType=[inner], algorithm=[none], 
> cost=[not available])
>     HiveProject(month=[$0])
>       HiveUnion(all=[true])
>         
> HiveProject(month=[CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP,
>  _UTF-16LE'-MM-d
> d':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 7), 
> _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-
> 16LE", _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER])
>           HiveTableScan(table=[[default, test1]], table:alias=[test1])
>         HiveProject(month=[$0])
>           HiveFilter(condition=[=($0, CAST(202110):INTEGER)])
>             HiveTableScan(table=[[default, test2]], table:alias=[test2])
>     HiveTableScan(table=[[default, test3]], table:alias=[d]){noformat}
>  
> Then, the HEP planner will keep expanding the filter expression with 
> redundant expressions, such as the following, where the identical CAST 
> expression is present multiple times:
>  
> {noformat}
> rel#118:HiveFilter.HIVE.[].any(input=HepRelVertex#39,condition=IN(CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP,
>  _UTF-16LE'-MM-dd':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 
> 7), _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-16LE", 
> _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER, 
> CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP, 
> _UTF-16LE'-MM-dd':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 
> 7), _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-16LE", 
> _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER, 
> 202110)){noformat}
>  
> The problem seems to come from a bad interaction of at least 
> _HiveFilterProjectTransposeRule_ and 
> {_}HiveJoinPushTransitivePredicatesRule{_}, possibly more.
> Most probably then UNION part can be removed and the reproducer be simplified 
> even further.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Work logged] (HIVE-25758) OOM due to recursive application of CBO rules

2022-04-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25758?focusedWorklogId=761726&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-761726
 ]

ASF GitHub Bot logged work on HIVE-25758:
-

Author: ASF GitHub Bot
Created on: 25/Apr/22 11:40
Start Date: 25/Apr/22 11:40
Worklog Time Spent: 10m 
  Work Description: zabetak commented on code in PR #2966:
URL: https://github.com/apache/hive/pull/2966#discussion_r857510400


##
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinPushTransitivePredicatesRule.java:
##
@@ -145,19 +138,25 @@ private ImmutableList getValidPreds(RelNode 
child, Set predicat
   }
 }
 
-// We need to filter i) those that have been pushed already as stored in 
the join,
-// ii) those that were already in the subtree rooted at child,
-// iii) predicates that are not safe for transitive inference.
+// We need to filter:
+//  i) those that have been pushed already as stored in the join,
+//  ii) those that were already in the subtree rooted at child.
+List toPush = 
HiveCalciteUtil.getPredsNotPushedAlready(predicatesToExclude, child, valids);
+
+// If we run the rule in conservative mode, we also filter:
+//  iii) predicates that are not safe for transitive inference.
 //
 // There is no formal definition of safety for predicate inference, only 
an empirical one.
 // An unsafe predicate in this context is one that when pushed across join 
operands, can lead
 // to redundant predicates that cannot be simplified (by means of 
predicates merging with other existing ones).
 // This situation can lead to an OOM for cases where lack of 
simplification allows inferring new predicates
-// (from LHS to RHS) recursively, predicates which are redundant, but that 
RexSimplify cannot handle.
+// (from LHS to RHS and vice-versa) recursively, predicates which are 
redundant, but that RexSimplify cannot handle.
 // This notion can be relaxed as soon as RexSimplify gets more powerful, 
and it can handle such cases.
-List toPush = 
HiveCalciteUtil.getPredsNotPushedAlready(predicatesToExclude, child, 
valids).stream()
-.filter(unsafeOperatorsFinder::isSafe)
-.collect(Collectors.toList());
+if (HiveConf.getBoolVar(conf, 
HiveConf.ConfVars.HIVE_JOIN_PUSH_TRANSITIVE_PREDICATES_CONSERVATIVE)) {

Review Comment:
   Good idea to add a configuration flag for this change.



##
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinPushTransitivePredicatesRule.java:
##
@@ -143,28 +138,82 @@ private ImmutableList 
getValidPreds(RelOptCluster cluster, RelNode chil
   }
 }
 
-// We need to filter i) those that have been pushed already as stored in 
the join,
-// and ii) those that were already in the subtree rooted at child
-ImmutableList toPush = 
HiveCalciteUtil.getPredsNotPushedAlready(predicatesToExclude,
-child, valids);
-return toPush;
+// We need to filter:
+//  i) those that have been pushed already as stored in the join,
+//  ii) those that were already in the subtree rooted at child.
+List toPush = 
HiveCalciteUtil.getPredsNotPushedAlready(predicatesToExclude, child, valids);
+
+// If we run the rule in conservative mode, we also filter:
+//  iii) predicates that are not safe for transitive inference.
+//
+// There is no formal definition of safety for predicate inference, only 
an empirical one.
+// An unsafe predicate in this context is one that when pushed across join 
operands, can lead
+// to redundant predicates that cannot be simplified (by means of 
predicates merging with other existing ones).
+// This situation can lead to an OOM for cases where lack of 
simplification allows inferring new predicates
+// (from LHS to RHS and vice-versa) recursively, predicates which are 
redundant, but that RexSimplify cannot handle.
+// This notion can be relaxed as soon as RexSimplify gets more powerful, 
and it can handle such cases.
+if (HiveConf.getBoolVar(conf, 
HiveConf.ConfVars.HIVE_JOIN_PUSH_TRANSITIVE_PREDICATES_CONSERVATIVE)) {
+  toPush = toPush.stream()
+  .filter(unsafeOperatorsFinder::isSafe)
+  .collect(Collectors.toList());
+}
+
+return ImmutableList.copyOf(toPush);
   }
 
-  private RexNode getTypeSafePred(RelOptCluster cluster, RexNode rex, 
RelDataType rType) {
-RexNode typeSafeRex = rex;
-if ((typeSafeRex instanceof RexCall) && 
HiveCalciteUtil.isComparisonOp((RexCall) typeSafeRex)) {
-  RexBuilder rb = cluster.getRexBuilder();
-  List fixedPredElems = new ArrayList();
-  RelDataType commonType = cluster.getTypeFactory().leastRestrictive(
-  RexUtil.types(((RexCall) rex).getOperands()));
-  for (RexNode rn : ((RexCall) rex).getOperands()) {
-fixedPredElems.add(rb.e

[jira] [Work logged] (HIVE-25758) OOM due to recursive application of CBO rules

2022-03-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25758?focusedWorklogId=747695&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-747695
 ]

ASF GitHub Bot logged work on HIVE-25758:
-

Author: ASF GitHub Bot
Created on: 25/Mar/22 09:51
Start Date: 25/Mar/22 09:51
Worklog Time Spent: 10m 
  Work Description: asolimando commented on pull request #2966:
URL: https://github.com/apache/hive/pull/2966#issuecomment-1078841442


   Gentle reminder for the review @zabetak :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 747695)
Time Spent: 1h 50m  (was: 1h 40m)

> OOM due to recursive application of CBO rules
> -
>
> Key: HIVE-25758
> URL: https://issues.apache.org/jira/browse/HIVE-25758
> Project: Hive
>  Issue Type: Bug
>  Components: CBO, Query Planning
>Affects Versions: 4.0.0
>Reporter: Alessandro Solimando
>Assignee: Alessandro Solimando
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
>  
> Reproducing query is as follows:
> {code:java}
> create table test1 (act_nbr string);
> create table test2 (month int);
> create table test3 (mth int, con_usd double);
> EXPLAIN
>SELECT c.month,
>   d.con_usd
>FROM
>  (SELECT 
> cast(regexp_replace(substr(add_months(from_unixtime(unix_timestamp(), 
> '-MM-dd'), -1), 1, 7), '-', '') AS int) AS month
>   FROM test1
>   UNION ALL
>   SELECT month
>   FROM test2
>   WHERE month = 202110) c
>JOIN test3 d ON c.month = d.mth; {code}
>  
> Different plans are generated during the first CBO steps, last being:
> {noformat}
> 2021-12-01T08:28:08,598 DEBUG [a18191bb-3a2b-4193-9abf-4e37dd1996bb main] 
> parse.CalcitePlanner: Plan after decorre
> lation:
> HiveProject(month=[$0], con_usd=[$2])
>   HiveJoin(condition=[=($0, $1)], joinType=[inner], algorithm=[none], 
> cost=[not available])
>     HiveProject(month=[$0])
>       HiveUnion(all=[true])
>         
> HiveProject(month=[CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP,
>  _UTF-16LE'-MM-d
> d':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 7), 
> _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-
> 16LE", _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER])
>           HiveTableScan(table=[[default, test1]], table:alias=[test1])
>         HiveProject(month=[$0])
>           HiveFilter(condition=[=($0, CAST(202110):INTEGER)])
>             HiveTableScan(table=[[default, test2]], table:alias=[test2])
>     HiveTableScan(table=[[default, test3]], table:alias=[d]){noformat}
>  
> Then, the HEP planner will keep expanding the filter expression with 
> redundant expressions, such as the following, where the identical CAST 
> expression is present multiple times:
>  
> {noformat}
> rel#118:HiveFilter.HIVE.[].any(input=HepRelVertex#39,condition=IN(CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP,
>  _UTF-16LE'-MM-dd':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 
> 7), _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-16LE", 
> _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER, 
> CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP, 
> _UTF-16LE'-MM-dd':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 
> 7), _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-16LE", 
> _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER, 
> 202110)){noformat}
>  
> The problem seems to come from a bad interaction of at least 
> _HiveFilterProjectTransposeRule_ and 
> {_}HiveJoinPushTransitivePredicatesRule{_}, possibly more.
> Most probably then UNION part can be removed and the reproducer be simplified 
> even further.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Work logged] (HIVE-25758) OOM due to recursive application of CBO rules

2022-03-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25758?focusedWorklogId=742896&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-742896
 ]

ASF GitHub Bot logged work on HIVE-25758:
-

Author: ASF GitHub Bot
Created on: 17/Mar/22 00:16
Start Date: 17/Mar/22 00:16
Worklog Time Spent: 10m 
  Work Description: github-actions[bot] closed pull request #2840:
URL: https://github.com/apache/hive/pull/2840


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 742896)
Time Spent: 1h 40m  (was: 1.5h)

> OOM due to recursive application of CBO rules
> -
>
> Key: HIVE-25758
> URL: https://issues.apache.org/jira/browse/HIVE-25758
> Project: Hive
>  Issue Type: Bug
>  Components: CBO, Query Planning
>Affects Versions: 4.0.0
>Reporter: Alessandro Solimando
>Assignee: Alessandro Solimando
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
>  
> Reproducing query is as follows:
> {code:java}
> create table test1 (act_nbr string);
> create table test2 (month int);
> create table test3 (mth int, con_usd double);
> EXPLAIN
>SELECT c.month,
>   d.con_usd
>FROM
>  (SELECT 
> cast(regexp_replace(substr(add_months(from_unixtime(unix_timestamp(), 
> '-MM-dd'), -1), 1, 7), '-', '') AS int) AS month
>   FROM test1
>   UNION ALL
>   SELECT month
>   FROM test2
>   WHERE month = 202110) c
>JOIN test3 d ON c.month = d.mth; {code}
>  
> Different plans are generated during the first CBO steps, last being:
> {noformat}
> 2021-12-01T08:28:08,598 DEBUG [a18191bb-3a2b-4193-9abf-4e37dd1996bb main] 
> parse.CalcitePlanner: Plan after decorre
> lation:
> HiveProject(month=[$0], con_usd=[$2])
>   HiveJoin(condition=[=($0, $1)], joinType=[inner], algorithm=[none], 
> cost=[not available])
>     HiveProject(month=[$0])
>       HiveUnion(all=[true])
>         
> HiveProject(month=[CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP,
>  _UTF-16LE'-MM-d
> d':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 7), 
> _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-
> 16LE", _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER])
>           HiveTableScan(table=[[default, test1]], table:alias=[test1])
>         HiveProject(month=[$0])
>           HiveFilter(condition=[=($0, CAST(202110):INTEGER)])
>             HiveTableScan(table=[[default, test2]], table:alias=[test2])
>     HiveTableScan(table=[[default, test3]], table:alias=[d]){noformat}
>  
> Then, the HEP planner will keep expanding the filter expression with 
> redundant expressions, such as the following, where the identical CAST 
> expression is present multiple times:
>  
> {noformat}
> rel#118:HiveFilter.HIVE.[].any(input=HepRelVertex#39,condition=IN(CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP,
>  _UTF-16LE'-MM-dd':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 
> 7), _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-16LE", 
> _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER, 
> CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP, 
> _UTF-16LE'-MM-dd':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 
> 7), _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-16LE", 
> _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER, 
> 202110)){noformat}
>  
> The problem seems to come from a bad interaction of at least 
> _HiveFilterProjectTransposeRule_ and 
> {_}HiveJoinPushTransitivePredicatesRule{_}, possibly more.
> Most probably then UNION part can be removed and the reproducer be simplified 
> even further.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Work logged] (HIVE-25758) OOM due to recursive application of CBO rules

2022-03-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25758?focusedWorklogId=738453&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-738453
 ]

ASF GitHub Bot logged work on HIVE-25758:
-

Author: ASF GitHub Bot
Created on: 09/Mar/22 00:16
Start Date: 09/Mar/22 00:16
Worklog Time Spent: 10m 
  Work Description: github-actions[bot] commented on pull request #2840:
URL: https://github.com/apache/hive/pull/2840#issuecomment-1062419396


   This pull request has been automatically marked as stale because it has not 
had recent activity. It will be closed if no further activity occurs.
   Feel free to reach out on the d...@hive.apache.org list if the patch is in 
need of reviews.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 738453)
Time Spent: 1.5h  (was: 1h 20m)

> OOM due to recursive application of CBO rules
> -
>
> Key: HIVE-25758
> URL: https://issues.apache.org/jira/browse/HIVE-25758
> Project: Hive
>  Issue Type: Bug
>  Components: CBO, Query Planning
>Affects Versions: 4.0.0
>Reporter: Alessandro Solimando
>Assignee: Alessandro Solimando
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
>  
> Reproducing query is as follows:
> {code:java}
> create table test1 (act_nbr string);
> create table test2 (month int);
> create table test3 (mth int, con_usd double);
> EXPLAIN
>SELECT c.month,
>   d.con_usd
>FROM
>  (SELECT 
> cast(regexp_replace(substr(add_months(from_unixtime(unix_timestamp(), 
> '-MM-dd'), -1), 1, 7), '-', '') AS int) AS month
>   FROM test1
>   UNION ALL
>   SELECT month
>   FROM test2
>   WHERE month = 202110) c
>JOIN test3 d ON c.month = d.mth; {code}
>  
> Different plans are generated during the first CBO steps, last being:
> {noformat}
> 2021-12-01T08:28:08,598 DEBUG [a18191bb-3a2b-4193-9abf-4e37dd1996bb main] 
> parse.CalcitePlanner: Plan after decorre
> lation:
> HiveProject(month=[$0], con_usd=[$2])
>   HiveJoin(condition=[=($0, $1)], joinType=[inner], algorithm=[none], 
> cost=[not available])
>     HiveProject(month=[$0])
>       HiveUnion(all=[true])
>         
> HiveProject(month=[CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP,
>  _UTF-16LE'-MM-d
> d':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 7), 
> _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-
> 16LE", _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER])
>           HiveTableScan(table=[[default, test1]], table:alias=[test1])
>         HiveProject(month=[$0])
>           HiveFilter(condition=[=($0, CAST(202110):INTEGER)])
>             HiveTableScan(table=[[default, test2]], table:alias=[test2])
>     HiveTableScan(table=[[default, test3]], table:alias=[d]){noformat}
>  
> Then, the HEP planner will keep expanding the filter expression with 
> redundant expressions, such as the following, where the identical CAST 
> expression is present multiple times:
>  
> {noformat}
> rel#118:HiveFilter.HIVE.[].any(input=HepRelVertex#39,condition=IN(CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP,
>  _UTF-16LE'-MM-dd':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 
> 7), _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-16LE", 
> _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER, 
> CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP, 
> _UTF-16LE'-MM-dd':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 
> 7), _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-16LE", 
> _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER, 
> 202110)){noformat}
>  
> The problem seems to come from a bad interaction of at least 
> _HiveFilterProjectTransposeRule_ and 
> {_}HiveJoinPushTransitivePredicatesRule{_}, possibly more.
> Most probably then UNION part can be removed and the reproducer be simplified 
> even further.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Work logged] (HIVE-25758) OOM due to recursive application of CBO rules

2022-02-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25758?focusedWorklogId=725071&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-725071
 ]

ASF GitHub Bot logged work on HIVE-25758:
-

Author: ASF GitHub Bot
Created on: 11/Feb/22 11:30
Start Date: 11/Feb/22 11:30
Worklog Time Spent: 10m 
  Work Description: asolimando commented on pull request #2966:
URL: https://github.com/apache/hive/pull/2966#issuecomment-1036109130


   @zabetak, can I ask you to have another look when you have time?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 725071)
Time Spent: 1h 20m  (was: 1h 10m)

> OOM due to recursive application of CBO rules
> -
>
> Key: HIVE-25758
> URL: https://issues.apache.org/jira/browse/HIVE-25758
> Project: Hive
>  Issue Type: Bug
>  Components: CBO, Query Planning
>Affects Versions: 4.0.0
>Reporter: Alessandro Solimando
>Assignee: Alessandro Solimando
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
>  
> Reproducing query is as follows:
> {code:java}
> create table test1 (act_nbr string);
> create table test2 (month int);
> create table test3 (mth int, con_usd double);
> EXPLAIN
>SELECT c.month,
>   d.con_usd
>FROM
>  (SELECT 
> cast(regexp_replace(substr(add_months(from_unixtime(unix_timestamp(), 
> '-MM-dd'), -1), 1, 7), '-', '') AS int) AS month
>   FROM test1
>   UNION ALL
>   SELECT month
>   FROM test2
>   WHERE month = 202110) c
>JOIN test3 d ON c.month = d.mth; {code}
>  
> Different plans are generated during the first CBO steps, last being:
> {noformat}
> 2021-12-01T08:28:08,598 DEBUG [a18191bb-3a2b-4193-9abf-4e37dd1996bb main] 
> parse.CalcitePlanner: Plan after decorre
> lation:
> HiveProject(month=[$0], con_usd=[$2])
>   HiveJoin(condition=[=($0, $1)], joinType=[inner], algorithm=[none], 
> cost=[not available])
>     HiveProject(month=[$0])
>       HiveUnion(all=[true])
>         
> HiveProject(month=[CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP,
>  _UTF-16LE'-MM-d
> d':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 7), 
> _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-
> 16LE", _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER])
>           HiveTableScan(table=[[default, test1]], table:alias=[test1])
>         HiveProject(month=[$0])
>           HiveFilter(condition=[=($0, CAST(202110):INTEGER)])
>             HiveTableScan(table=[[default, test2]], table:alias=[test2])
>     HiveTableScan(table=[[default, test3]], table:alias=[d]){noformat}
>  
> Then, the HEP planner will keep expanding the filter expression with 
> redundant expressions, such as the following, where the identical CAST 
> expression is present multiple times:
>  
> {noformat}
> rel#118:HiveFilter.HIVE.[].any(input=HepRelVertex#39,condition=IN(CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP,
>  _UTF-16LE'-MM-dd':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 
> 7), _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-16LE", 
> _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER, 
> CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP, 
> _UTF-16LE'-MM-dd':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 
> 7), _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-16LE", 
> _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER, 
> 202110)){noformat}
>  
> The problem seems to come from a bad interaction of at least 
> _HiveFilterProjectTransposeRule_ and 
> {_}HiveJoinPushTransitivePredicatesRule{_}, possibly more.
> Most probably then UNION part can be removed and the reproducer be simplified 
> even further.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Work logged] (HIVE-25758) OOM due to recursive application of CBO rules

2022-02-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25758?focusedWorklogId=720033&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-720033
 ]

ASF GitHub Bot logged work on HIVE-25758:
-

Author: ASF GitHub Bot
Created on: 03/Feb/22 11:16
Start Date: 03/Feb/22 11:16
Worklog Time Spent: 10m 
  Work Description: asolimando commented on a change in pull request #2966:
URL: https://github.com/apache/hive/pull/2966#discussion_r798459592



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinPushTransitivePredicatesRule.java
##
@@ -144,27 +146,70 @@ public void onMatch(RelOptRuleCall call) {
 }
 
 // We need to filter i) those that have been pushed already as stored in 
the join,
-// and ii) those that were already in the subtree rooted at child
-ImmutableList toPush = 
HiveCalciteUtil.getPredsNotPushedAlready(predicatesToExclude,
-child, valids);
-return toPush;
+// ii) those that were already in the subtree rooted at child,
+// iii) predicates that are not safe for transitive inference.
+//
+// There is no formal definition of safety for predicate inference, only 
an empirical one.
+// An unsafe predicate in this context is one that when pushed across join 
operands, can lead
+// to redundant predicates that cannot be simplified (by means of 
predicates merging with other existing ones).
+// This situation can lead to an OOM for cases where lack of 
simplification allows inferring new predicates
+// (from LHS to RHS) recursively, predicates which are redundant, but that 
RexSimplify cannot handle.
+// This notion can be relaxed as soon as RexSimplify gets more powerful, 
and it can handle such cases.
+List toPush = 
HiveCalciteUtil.getPredsNotPushedAlready(predicatesToExclude, child, 
valids).stream()
+.filter(UNSAFE_OPERATORS_FINDER::isSafe)
+.collect(Collectors.toList());
+
+return ImmutableList.copyOf(toPush);
   }
 
-  private RexNode getTypeSafePred(RelOptCluster cluster, RexNode rex, 
RelDataType rType) {
-RexNode typeSafeRex = rex;
-if ((typeSafeRex instanceof RexCall) && 
HiveCalciteUtil.isComparisonOp((RexCall) typeSafeRex)) {
-  RexBuilder rb = cluster.getRexBuilder();
-  List fixedPredElems = new ArrayList();
-  RelDataType commonType = cluster.getTypeFactory().leastRestrictive(
-  RexUtil.types(((RexCall) rex).getOperands()));
-  for (RexNode rn : ((RexCall) rex).getOperands()) {
-fixedPredElems.add(rb.ensureType(commonType, rn, true));
-  }
+  //~ Inner Classes --
+
+  /**
+   * Finds unsafe operators in an expression (at any level of nesting).
+   */
+  private static class UnsafeOperatorsFinder extends RexVisitorImpl {
+// accounting for DeMorgan's law
+boolean inNegation = false;
 
-  typeSafeRex = rb.makeCall(((RexCall) typeSafeRex).getOperator(), 
fixedPredElems);
+protected UnsafeOperatorsFinder(boolean deep) {
+  super(deep);
 }
 
-return typeSafeRex;
+@Override
+public Void visitCall(RexCall call) {
+  switch (call.getKind()) {
+  case OR:
+if (inNegation) {
+  return super.visitCall(call);
+} else {
+  throw Util.FoundOne.NULL;

Review comment:
   I know it's an anti-pattern but it's anywhere in the codebase (even in 
Calcite IIRC I have seen it), but it's better not to add another instance of 
it, I have updated the code.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 720033)
Time Spent: 1h 10m  (was: 1h)

> OOM due to recursive application of CBO rules
> -
>
> Key: HIVE-25758
> URL: https://issues.apache.org/jira/browse/HIVE-25758
> Project: Hive
>  Issue Type: Bug
>  Components: CBO, Query Planning
>Affects Versions: 4.0.0
>Reporter: Alessandro Solimando
>Assignee: Alessandro Solimando
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
>  
> Reproducing query is as follows:
> {code:java}
> create table test1 (act_nbr string);
> create table test2 (month int);
> create table test3 (mth int, con_usd double);
> EXPLAIN
>SELECT c.month,
>   d.con_usd
>FROM
>  (SELECT 
> cast(regexp_replace(substr(add_months(from_unixtime(unix_timestamp(), 
> '-MM-dd'), -1), 1, 7), '-', '') AS int) AS mon

[jira] [Work logged] (HIVE-25758) OOM due to recursive application of CBO rules

2022-02-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25758?focusedWorklogId=720032&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-720032
 ]

ASF GitHub Bot logged work on HIVE-25758:
-

Author: ASF GitHub Bot
Created on: 03/Feb/22 11:15
Start Date: 03/Feb/22 11:15
Worklog Time Spent: 10m 
  Work Description: asolimando commented on a change in pull request #2966:
URL: https://github.com/apache/hive/pull/2966#discussion_r798459592



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinPushTransitivePredicatesRule.java
##
@@ -144,27 +146,70 @@ public void onMatch(RelOptRuleCall call) {
 }
 
 // We need to filter i) those that have been pushed already as stored in 
the join,
-// and ii) those that were already in the subtree rooted at child
-ImmutableList toPush = 
HiveCalciteUtil.getPredsNotPushedAlready(predicatesToExclude,
-child, valids);
-return toPush;
+// ii) those that were already in the subtree rooted at child,
+// iii) predicates that are not safe for transitive inference.
+//
+// There is no formal definition of safety for predicate inference, only 
an empirical one.
+// An unsafe predicate in this context is one that when pushed across join 
operands, can lead
+// to redundant predicates that cannot be simplified (by means of 
predicates merging with other existing ones).
+// This situation can lead to an OOM for cases where lack of 
simplification allows inferring new predicates
+// (from LHS to RHS) recursively, predicates which are redundant, but that 
RexSimplify cannot handle.
+// This notion can be relaxed as soon as RexSimplify gets more powerful, 
and it can handle such cases.
+List toPush = 
HiveCalciteUtil.getPredsNotPushedAlready(predicatesToExclude, child, 
valids).stream()
+.filter(UNSAFE_OPERATORS_FINDER::isSafe)
+.collect(Collectors.toList());
+
+return ImmutableList.copyOf(toPush);
   }
 
-  private RexNode getTypeSafePred(RelOptCluster cluster, RexNode rex, 
RelDataType rType) {
-RexNode typeSafeRex = rex;
-if ((typeSafeRex instanceof RexCall) && 
HiveCalciteUtil.isComparisonOp((RexCall) typeSafeRex)) {
-  RexBuilder rb = cluster.getRexBuilder();
-  List fixedPredElems = new ArrayList();
-  RelDataType commonType = cluster.getTypeFactory().leastRestrictive(
-  RexUtil.types(((RexCall) rex).getOperands()));
-  for (RexNode rn : ((RexCall) rex).getOperands()) {
-fixedPredElems.add(rb.ensureType(commonType, rn, true));
-  }
+  //~ Inner Classes --
+
+  /**
+   * Finds unsafe operators in an expression (at any level of nesting).
+   */
+  private static class UnsafeOperatorsFinder extends RexVisitorImpl {
+// accounting for DeMorgan's law
+boolean inNegation = false;
 
-  typeSafeRex = rb.makeCall(((RexCall) typeSafeRex).getOperator(), 
fixedPredElems);
+protected UnsafeOperatorsFinder(boolean deep) {
+  super(deep);
 }
 
-return typeSafeRex;
+@Override
+public Void visitCall(RexCall call) {
+  switch (call.getKind()) {
+  case OR:
+if (inNegation) {
+  return super.visitCall(call);
+} else {
+  throw Util.FoundOne.NULL;

Review comment:
   I know it's an anti-pattern but it's anywhere in the codebase (even in 
Calcite IIRC I have seen it), but it's better not to add another instance of it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 720032)
Time Spent: 1h  (was: 50m)

> OOM due to recursive application of CBO rules
> -
>
> Key: HIVE-25758
> URL: https://issues.apache.org/jira/browse/HIVE-25758
> Project: Hive
>  Issue Type: Bug
>  Components: CBO, Query Planning
>Affects Versions: 4.0.0
>Reporter: Alessandro Solimando
>Assignee: Alessandro Solimando
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
>  
> Reproducing query is as follows:
> {code:java}
> create table test1 (act_nbr string);
> create table test2 (month int);
> create table test3 (mth int, con_usd double);
> EXPLAIN
>SELECT c.month,
>   d.con_usd
>FROM
>  (SELECT 
> cast(regexp_replace(substr(add_months(from_unixtime(unix_timestamp(), 
> '-MM-dd'), -1), 1, 7), '-', '') AS int) AS month
>   FROM test1
>   UNI

[jira] [Work logged] (HIVE-25758) OOM due to recursive application of CBO rules

2022-02-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25758?focusedWorklogId=719996&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-719996
 ]

ASF GitHub Bot logged work on HIVE-25758:
-

Author: ASF GitHub Bot
Created on: 03/Feb/22 10:17
Start Date: 03/Feb/22 10:17
Worklog Time Spent: 10m 
  Work Description: asolimando commented on a change in pull request #2966:
URL: https://github.com/apache/hive/pull/2966#discussion_r798411704



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinPushTransitivePredicatesRule.java
##
@@ -144,27 +146,70 @@ public void onMatch(RelOptRuleCall call) {
 }
 
 // We need to filter i) those that have been pushed already as stored in 
the join,
-// and ii) those that were already in the subtree rooted at child
-ImmutableList toPush = 
HiveCalciteUtil.getPredsNotPushedAlready(predicatesToExclude,
-child, valids);
-return toPush;
+// ii) those that were already in the subtree rooted at child,
+// iii) predicates that are not safe for transitive inference.
+//
+// There is no formal definition of safety for predicate inference, only 
an empirical one.

Review comment:
   The change does not reinforce the dependency on the registry. If you 
check the call, we are filtering candidate predicates coming from 
`HiveRelMdPredicates`, which is not linked to the registry at all. The 
predicates in the registry are used to add another filtering step (to remove 
"duplicates"), but the two steps are totally independent, they are just 
performed within the same method.
   
   This said, I agree that keeping a global state is a problem, and should be 
avoided. The current implementation is also faulty under some respects.
   
   The motivation for storing predicates in the registry is given in 
[HIVE-12478](https://issues.apache.org/jira/browse/HIVE-12478?focusedCommentId=15047639&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15047639).
   
   Since predicates can be modified (pushed past project/setOp/joins/aggregate, 
they can be simplified or merged), we can try to push them transitively as it 
happens in this OOM cases.
   
   But, as I was saying, the current implementation is faulty, since even the 
registry is not enough, because most of the other rules they don't know about 
it, they will alter the predicates, but they won't register them into the 
registry. Making all those rules aware of the registry would increase the 
coupling, hence not an option IMO.
   
   The registry main use was to prevent firing the same rule against the same 
rel multiple times, but even this use is not complete, since rels are modified 
along the way, and we don't update the registry accordingly. For instance, a 
join condition is updated, the new join has a fresh rel-id, the registry will 
try to re-apply `HiveJoinPushTransitivePredicatesRule`, and if the predicates 
were modified too, this will also escape the duplicate check based on the 
`registry.getPushedPredicates(join)`, and we can have a loop).
   
   On the other hand, I don't see how we can prevent such loops without storing 
a global state. But for this global state to work, each and every rule must be 
aware of it, which is totally not doable IMO, which is also the line in Calcite.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 719996)
Time Spent: 50m  (was: 40m)

> OOM due to recursive application of CBO rules
> -
>
> Key: HIVE-25758
> URL: https://issues.apache.org/jira/browse/HIVE-25758
> Project: Hive
>  Issue Type: Bug
>  Components: CBO, Query Planning
>Affects Versions: 4.0.0
>Reporter: Alessandro Solimando
>Assignee: Alessandro Solimando
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
>  
> Reproducing query is as follows:
> {code:java}
> create table test1 (act_nbr string);
> create table test2 (month int);
> create table test3 (mth int, con_usd double);
> EXPLAIN
>SELECT c.month,
>   d.con_usd
>FROM
>  (SELECT 
> cast(regexp_replace(substr(add_months(from_unixtime(unix_timestamp(), 
> '-MM-dd'), -1), 1, 7), '-', '') AS int) AS month
>   FROM test1
>   UNION ALL
>   SELECT month
>   FROM test2
>   WHERE month = 202110) c
>JOIN test3 d ON c.month = d.mth; {code}
>  
> Different plans are gener

[jira] [Work logged] (HIVE-25758) OOM due to recursive application of CBO rules

2022-02-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25758?focusedWorklogId=719995&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-719995
 ]

ASF GitHub Bot logged work on HIVE-25758:
-

Author: ASF GitHub Bot
Created on: 03/Feb/22 10:16
Start Date: 03/Feb/22 10:16
Worklog Time Spent: 10m 
  Work Description: asolimando commented on a change in pull request #2966:
URL: https://github.com/apache/hive/pull/2966#discussion_r798411704



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinPushTransitivePredicatesRule.java
##
@@ -144,27 +146,70 @@ public void onMatch(RelOptRuleCall call) {
 }
 
 // We need to filter i) those that have been pushed already as stored in 
the join,
-// and ii) those that were already in the subtree rooted at child
-ImmutableList toPush = 
HiveCalciteUtil.getPredsNotPushedAlready(predicatesToExclude,
-child, valids);
-return toPush;
+// ii) those that were already in the subtree rooted at child,
+// iii) predicates that are not safe for transitive inference.
+//
+// There is no formal definition of safety for predicate inference, only 
an empirical one.

Review comment:
   The change does not reinforce the dependency on the registry. If you 
check the call, we are filtering candidate predicates coming from 
`HiveRelMdPredicates`, which is not linked to the registry at all. The 
predicates in the registry are used to add another filtering step (to remove 
"duplicates"), but the two steps are totally independent.
   
   This said, I agree that keeping a global state is a problem, and should be 
avoided. The current implementation is also faulty under some respects.
   
   The motivation for storing predicates in the registry is given in 
[HIVE-12478](https://issues.apache.org/jira/browse/HIVE-12478?focusedCommentId=15047639&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15047639).
   
   Since predicates can be modified (pushed past project/setOp/joins/aggregate, 
they can be simplified or merged), we can try to push them transitively as it 
happens in this OOM cases.
   
   But, as I was saying, the current implementation is faulty, since even the 
registry is not enough, because most of the other rules they don't know about 
it, they will alter the predicates, but they won't register them into the 
registry. Making all those rules aware of the registry would increase the 
coupling, hence not an option IMO.
   
   The registry main use was to prevent firing the same rule against the same 
rel multiple times, but even this use is not complete, since rels are modified 
along the way, and we don't update the registry accordingly. For instance, a 
join condition is updated, the new join has a fresh rel-id, the registry will 
try to re-apply `HiveJoinPushTransitivePredicatesRule`, and if the predicates 
were modified too, this will also escape the duplicate check based on the 
`registry.getPushedPredicates(join)`, and we can have a loop).
   
   On the other hand, I don't see how we can prevent such loops without storing 
a global state. But for this global state to work, each and every rule must be 
aware of it, which is totally not doable IMO, which is also the line in Calcite.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 719995)
Time Spent: 40m  (was: 0.5h)

> OOM due to recursive application of CBO rules
> -
>
> Key: HIVE-25758
> URL: https://issues.apache.org/jira/browse/HIVE-25758
> Project: Hive
>  Issue Type: Bug
>  Components: CBO, Query Planning
>Affects Versions: 4.0.0
>Reporter: Alessandro Solimando
>Assignee: Alessandro Solimando
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
>  
> Reproducing query is as follows:
> {code:java}
> create table test1 (act_nbr string);
> create table test2 (month int);
> create table test3 (mth int, con_usd double);
> EXPLAIN
>SELECT c.month,
>   d.con_usd
>FROM
>  (SELECT 
> cast(regexp_replace(substr(add_months(from_unixtime(unix_timestamp(), 
> '-MM-dd'), -1), 1, 7), '-', '') AS int) AS month
>   FROM test1
>   UNION ALL
>   SELECT month
>   FROM test2
>   WHERE month = 202110) c
>JOIN test3 d ON c.month = d.mth; {code}
>  
> Different plans are generated during the first CBO steps, last being:
> {

[jira] [Work logged] (HIVE-25758) OOM due to recursive application of CBO rules

2022-02-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25758?focusedWorklogId=719958&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-719958
 ]

ASF GitHub Bot logged work on HIVE-25758:
-

Author: ASF GitHub Bot
Created on: 03/Feb/22 09:03
Start Date: 03/Feb/22 09:03
Worklog Time Spent: 10m 
  Work Description: asolimando commented on a change in pull request #2966:
URL: https://github.com/apache/hive/pull/2966#discussion_r798348894



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveInBetweenExpandRule.java
##
@@ -138,7 +138,7 @@ public void onMatch(RelOptRuleCall call) {
 private final RexBuilder rexBuilder;
 private boolean modified;
 
-private RexInBetweenExpander(RexBuilder rexBuilder) {
+RexInBetweenExpander(RexBuilder rexBuilder) {

Review comment:
   TLDR: Sorry, that's just a leftover from a previous attempt.
   
   Long story:
   At some point I attempted to expand `IN` to check the impact on 
simplification, but there were more side-effects than benefits, I have dropped 
that part but forgot to put the original visibility back.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 719958)
Time Spent: 0.5h  (was: 20m)

> OOM due to recursive application of CBO rules
> -
>
> Key: HIVE-25758
> URL: https://issues.apache.org/jira/browse/HIVE-25758
> Project: Hive
>  Issue Type: Bug
>  Components: CBO, Query Planning
>Affects Versions: 4.0.0
>Reporter: Alessandro Solimando
>Assignee: Alessandro Solimando
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
>  
> Reproducing query is as follows:
> {code:java}
> create table test1 (act_nbr string);
> create table test2 (month int);
> create table test3 (mth int, con_usd double);
> EXPLAIN
>SELECT c.month,
>   d.con_usd
>FROM
>  (SELECT 
> cast(regexp_replace(substr(add_months(from_unixtime(unix_timestamp(), 
> '-MM-dd'), -1), 1, 7), '-', '') AS int) AS month
>   FROM test1
>   UNION ALL
>   SELECT month
>   FROM test2
>   WHERE month = 202110) c
>JOIN test3 d ON c.month = d.mth; {code}
>  
> Different plans are generated during the first CBO steps, last being:
> {noformat}
> 2021-12-01T08:28:08,598 DEBUG [a18191bb-3a2b-4193-9abf-4e37dd1996bb main] 
> parse.CalcitePlanner: Plan after decorre
> lation:
> HiveProject(month=[$0], con_usd=[$2])
>   HiveJoin(condition=[=($0, $1)], joinType=[inner], algorithm=[none], 
> cost=[not available])
>     HiveProject(month=[$0])
>       HiveUnion(all=[true])
>         
> HiveProject(month=[CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP,
>  _UTF-16LE'-MM-d
> d':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 7), 
> _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-
> 16LE", _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER])
>           HiveTableScan(table=[[default, test1]], table:alias=[test1])
>         HiveProject(month=[$0])
>           HiveFilter(condition=[=($0, CAST(202110):INTEGER)])
>             HiveTableScan(table=[[default, test2]], table:alias=[test2])
>     HiveTableScan(table=[[default, test3]], table:alias=[d]){noformat}
>  
> Then, the HEP planner will keep expanding the filter expression with 
> redundant expressions, such as the following, where the identical CAST 
> expression is present multiple times:
>  
> {noformat}
> rel#118:HiveFilter.HIVE.[].any(input=HepRelVertex#39,condition=IN(CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP,
>  _UTF-16LE'-MM-dd':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 
> 7), _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-16LE", 
> _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER, 
> CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP, 
> _UTF-16LE'-MM-dd':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 
> 7), _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-16LE", 
> _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER, 
> 202110)){noformat}
>  
> The problem seems to come from a bad interaction of at least 
> _HiveFilterProjectTransposeRule_ and 
> {_}HiveJoinPushTransitivePredicatesRule{_}, possibly more.
> Most probably then UNION part can be removed and the reproducer be simplified 
> even further.
>  



--
This message was sent by Atlassian Jira
(v

[jira] [Work logged] (HIVE-25758) OOM due to recursive application of CBO rules

2022-02-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25758?focusedWorklogId=718667&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-718667
 ]

ASF GitHub Bot logged work on HIVE-25758:
-

Author: ASF GitHub Bot
Created on: 01/Feb/22 13:28
Start Date: 01/Feb/22 13:28
Worklog Time Spent: 10m 
  Work Description: zabetak commented on a change in pull request #2966:
URL: https://github.com/apache/hive/pull/2966#discussion_r796565785



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinPushTransitivePredicatesRule.java
##
@@ -144,27 +146,70 @@ public void onMatch(RelOptRuleCall call) {
 }
 
 // We need to filter i) those that have been pushed already as stored in 
the join,
-// and ii) those that were already in the subtree rooted at child
-ImmutableList toPush = 
HiveCalciteUtil.getPredsNotPushedAlready(predicatesToExclude,
-child, valids);
-return toPush;
+// ii) those that were already in the subtree rooted at child,
+// iii) predicates that are not safe for transitive inference.
+//
+// There is no formal definition of safety for predicate inference, only 
an empirical one.
+// An unsafe predicate in this context is one that when pushed across join 
operands, can lead
+// to redundant predicates that cannot be simplified (by means of 
predicates merging with other existing ones).
+// This situation can lead to an OOM for cases where lack of 
simplification allows inferring new predicates
+// (from LHS to RHS) recursively, predicates which are redundant, but that 
RexSimplify cannot handle.
+// This notion can be relaxed as soon as RexSimplify gets more powerful, 
and it can handle such cases.
+List toPush = 
HiveCalciteUtil.getPredsNotPushedAlready(predicatesToExclude, child, 
valids).stream()
+.filter(UNSAFE_OPERATORS_FINDER::isSafe)
+.collect(Collectors.toList());
+
+return ImmutableList.copyOf(toPush);
   }
 
-  private RexNode getTypeSafePred(RelOptCluster cluster, RexNode rex, 
RelDataType rType) {
-RexNode typeSafeRex = rex;
-if ((typeSafeRex instanceof RexCall) && 
HiveCalciteUtil.isComparisonOp((RexCall) typeSafeRex)) {
-  RexBuilder rb = cluster.getRexBuilder();
-  List fixedPredElems = new ArrayList();
-  RelDataType commonType = cluster.getTypeFactory().leastRestrictive(
-  RexUtil.types(((RexCall) rex).getOperands()));
-  for (RexNode rn : ((RexCall) rex).getOperands()) {
-fixedPredElems.add(rb.ensureType(commonType, rn, true));
-  }
+  //~ Inner Classes --
+
+  /**
+   * Finds unsafe operators in an expression (at any level of nesting).

Review comment:
   Add few examples with "unsafe" expressions.

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinPushTransitivePredicatesRule.java
##
@@ -144,27 +146,70 @@ public void onMatch(RelOptRuleCall call) {
 }
 
 // We need to filter i) those that have been pushed already as stored in 
the join,
-// and ii) those that were already in the subtree rooted at child
-ImmutableList toPush = 
HiveCalciteUtil.getPredsNotPushedAlready(predicatesToExclude,
-child, valids);
-return toPush;
+// ii) those that were already in the subtree rooted at child,
+// iii) predicates that are not safe for transitive inference.
+//
+// There is no formal definition of safety for predicate inference, only 
an empirical one.

Review comment:
   The existence of the registry is an element that makes the behavior of 
this rule strongly dependent on the behavior of other rules. This change adds 
an additional dependency on the behavior of other rules & components which is 
not great. We can leave with it for some time but not sure if we can say it is 
a long term solution.

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinPushTransitivePredicatesRule.java
##
@@ -74,6 +74,8 @@
   public static final HiveJoinPushTransitivePredicatesRule INSTANCE_ANTIJOIN =
   new HiveJoinPushTransitivePredicatesRule(HiveAntiJoin.class, 
HiveRelFactories.HIVE_FILTER_FACTORY);
 
+  private static final UnsafeOperatorsFinder UNSAFE_OPERATORS_FINDER = new 
UnsafeOperatorsFinder(true);

Review comment:
   I think we may end-up calling this visitor from multiple threads so I 
would either remove static or make the class immutable.

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinPushTransitivePredicatesRule.java
##
@@ -144,27 +146,70 @@ public void onMatch(RelOptRuleCall call) {
 }
 
 // We need to filter i) those that have been pushed already as stored in 
the join,
-// and ii) those that were already in the subtree roo

[jira] [Work logged] (HIVE-25758) OOM due to recursive application of CBO rules

2021-12-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25758?focusedWorklogId=689644&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-689644
 ]

ASF GitHub Bot logged work on HIVE-25758:
-

Author: ASF GitHub Bot
Created on: 02/Dec/21 22:30
Start Date: 02/Dec/21 22:30
Worklog Time Spent: 10m 
  Work Description: asolimando opened a new pull request #2840:
URL: https://github.com/apache/hive/pull/2840


   
   ### What changes were proposed in this pull request?
   
   
   Removes OOM in query planning by adding simplification of new predicates 
created by `HiveJoinPushTransitivePredicatesRule`
   
   ### Why are the changes needed?
   
   
   Query in the JIRA ticket was failing with OOM.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No (apart from affected queries, now succeeding. 
   
   ### How was this patch tested?
   
   
   `mvn test -Dtest=TestMiniLlapLocalCliDriver 
-Dqfile="cbo_join_push_transitive_pred_oom.q" -pl itests/qtest -Pitests`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 689644)
Remaining Estimate: 0h
Time Spent: 10m

> OOM due to recursive application of CBO rules
> -
>
> Key: HIVE-25758
> URL: https://issues.apache.org/jira/browse/HIVE-25758
> Project: Hive
>  Issue Type: Bug
>  Components: CBO, Query Planning
>Affects Versions: 4.0.0
>Reporter: Alessandro Solimando
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
>  
> Reproducing query is as follows:
> {code:java}
> create table test1 (act_nbr string);
> create table test2 (month int);
> create table test3 (mth int, con_usd double);
> EXPLAIN
>SELECT c.month,
>   d.con_usd
>FROM
>  (SELECT 
> cast(regexp_replace(substr(add_months(from_unixtime(unix_timestamp(), 
> '-MM-dd'), -1), 1, 7), '-', '') AS int) AS month
>   FROM test1
>   UNION ALL
>   SELECT month
>   FROM test2
>   WHERE month = 202110) c
>JOIN test3 d ON c.month = d.mth; {code}
>  
> Different plans are generated during the first CBO steps, last being:
> {noformat}
> 2021-12-01T08:28:08,598 DEBUG [a18191bb-3a2b-4193-9abf-4e37dd1996bb main] 
> parse.CalcitePlanner: Plan after decorre
> lation:
> HiveProject(month=[$0], con_usd=[$2])
>   HiveJoin(condition=[=($0, $1)], joinType=[inner], algorithm=[none], 
> cost=[not available])
>     HiveProject(month=[$0])
>       HiveUnion(all=[true])
>         
> HiveProject(month=[CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP,
>  _UTF-16LE'-MM-d
> d':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 7), 
> _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-
> 16LE", _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER])
>           HiveTableScan(table=[[default, test1]], table:alias=[test1])
>         HiveProject(month=[$0])
>           HiveFilter(condition=[=($0, CAST(202110):INTEGER)])
>             HiveTableScan(table=[[default, test2]], table:alias=[test2])
>     HiveTableScan(table=[[default, test3]], table:alias=[d]){noformat}
>  
> Then, the HEP planner will keep expanding the filter expression with 
> redundant expressions, such as the following, where the identical CAST 
> expression is present multiple times:
>  
> {noformat}
> rel#118:HiveFilter.HIVE.[].any(input=HepRelVertex#39,condition=IN(CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP,
>  _UTF-16LE'-MM-dd':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 
> 7), _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-16LE", 
> _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER, 
> CAST(regexp_replace(substr(add_months(FROM_UNIXTIME(UNIX_TIMESTAMP, 
> _UTF-16LE'-MM-dd':VARCHAR(2147483647) CHARACTER SET "UTF-16LE"), -1), 1, 
> 7), _UTF-16LE'-':VARCHAR(2147483647) CHARACTER SET "UTF-16LE", 
> _UTF-16LE'':VARCHAR(2147483647) CHARACTER SET "UTF-16LE")):INTEGER, 
> 202110)){noformat}
>  
> The problem seems to come from a bad interaction of at least 
> _HiveFilterProjectTransposeRule_ and 
> {_}HiveJoinPushTransitivePredicatesRule{_}, possibly more.
> Most probably then UNION part can be removed and the reproducer be simplified 
> even further.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)