subject:"\[jira\] \[Work logged\] \(HIVE\-23716\) Support Anti Join in Hive"

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-20 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461139&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461139
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 20/Jul/20 15:51
Start Date: 20/Jul/20 15:51
Worklog Time Spent: 10m 
  Work Description: ramesh0201 commented on pull request #1147:
URL: https://github.com/apache/hive/pull/1147#issuecomment-661124391


   Runtime changes look good to me +1.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 461139)
Time Spent: 1h 10m  (was: 1h)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-20 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461160&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461160
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 20/Jul/20 16:37
Start Date: 20/Jul/20 16:37
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r457545081



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinWithFilterToAntiJoinRule.java
##
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite.rules;
+
+import org.apache.calcite.plan.RelOptRule;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.plan.RelOptUtil;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.rel.core.Filter;
+import org.apache.calcite.rel.core.Join;
+import org.apache.calcite.rel.core.JoinRelType;
+import org.apache.calcite.rel.core.Project;
+import org.apache.calcite.rel.type.RelDataTypeField;
+import org.apache.calcite.rex.RexInputRef;
+import org.apache.calcite.rex.RexNode;
+import org.apache.calcite.sql.SqlKind;
+import org.apache.calcite.util.ImmutableBitSet;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.ArrayList;
+import java.util.List;
+
+/**
+ * Planner rule that converts a join plus filter to anti join.
+ */
+public class HiveJoinWithFilterToAntiJoinRule extends RelOptRule {
+  protected static final Logger LOG = 
LoggerFactory.getLogger(HiveJoinWithFilterToAntiJoinRule.class);
+  public static final HiveJoinWithFilterToAntiJoinRule INSTANCE = new 
HiveJoinWithFilterToAntiJoinRule();
+
+  //HiveProject(fld=[$0])
+  //  HiveFilter(condition=[IS NULL($1)])
+  //HiveJoin(condition=[=($0, $1)], joinType=[left], algorithm=[none], 
cost=[not available])
+  //
+  // TO
+  //
+  //HiveProject(fld_tbl=[$0])
+  //  HiveAntiJoin(condition=[=($0, $1)], joinType=[anti])
+  //
+  public HiveJoinWithFilterToAntiJoinRule() {
+super(operand(Project.class, operand(Filter.class, operand(Join.class, 
RelOptRule.any(,
+"HiveJoinWithFilterToAntiJoinRule:filter");
+  }
+
+  // is null filter over a left join.
+  public void onMatch(final RelOptRuleCall call) {
+final Project project = call.rel(0);
+final Filter filter = call.rel(1);
+final Join join = call.rel(2);
+perform(call, project, filter, join);
+  }
+
+  protected void perform(RelOptRuleCall call, Project project, Filter filter, 
Join join) {
+LOG.debug("Matched HiveAntiJoinRule");

Review comment:
   sure ..will do that 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 461160)
Time Spent: 1h 20m  (was: 1h 10m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows f

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-20 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461406&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461406
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 21/Jul/20 04:32
Start Date: 21/Jul/20 04:32
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r457829820



##
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##
@@ -2162,7 +2162,8 @@ private static void populateLlapDaemonVarsSet(Set 
llapDaemonVarsSetLocal
 "Whether Hive enables the optimization about converting common join 
into mapjoin based on the input file size. \n" +
 "If this parameter is on, and the sum of size for n-1 of the 
tables/partitions for a n-way join is smaller than the\n" +
 "specified size, the join is directly converted to a mapjoin (there is 
no conditional task)."),
-
+HIVE_CONVERT_ANTI_JOIN("hive.auto.convert.anti.join", false,

Review comment:
   Is there any reason why we should not enable this by default in master? 
It seems it is always beneficial to execute the antijoin since we already have 
a vectorized implementation too. That would increase the test coverage for the 
feature.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 461406)
Time Spent: 1.5h  (was: 1h 20m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-21 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461564&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461564
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 21/Jul/20 13:09
Start Date: 21/Jul/20 13:09
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r458082243



##
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##
@@ -2162,7 +2162,8 @@ private static void populateLlapDaemonVarsSet(Set 
llapDaemonVarsSetLocal
 "Whether Hive enables the optimization about converting common join 
into mapjoin based on the input file size. \n" +
 "If this parameter is on, and the sum of size for n-1 of the 
tables/partitions for a n-way join is smaller than the\n" +
 "specified size, the join is directly converted to a mapjoin (there is 
no conditional task)."),
-
+HIVE_CONVERT_ANTI_JOIN("hive.auto.convert.anti.join", false,

Review comment:
   Agree with the above, I believe we should enable anti-join by default as 
1) this feature should aways improve runtime 2) can help us find possible 
issues and 3) further optimize existing implementation based on future scenarios





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 461564)
Time Spent: 1h 50m  (was: 1h 40m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-21 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461562&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461562
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 21/Jul/20 13:09
Start Date: 21/Jul/20 13:09
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r458082243



##
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##
@@ -2162,7 +2162,8 @@ private static void populateLlapDaemonVarsSet(Set 
llapDaemonVarsSetLocal
 "Whether Hive enables the optimization about converting common join 
into mapjoin based on the input file size. \n" +
 "If this parameter is on, and the sum of size for n-1 of the 
tables/partitions for a n-way join is smaller than the\n" +
 "specified size, the join is directly converted to a mapjoin (there is 
no conditional task)."),
-
+HIVE_CONVERT_ANTI_JOIN("hive.auto.convert.anti.join", false,

Review comment:
   Agree with the above, I believe we should enable anti-join by default as 
1) this feature should aways improve runtime 2) can help us find possible 
issues and further optimize existing implementation





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 461562)
Time Spent: 1h 40m  (was: 1.5h)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-21 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461604&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461604
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 21/Jul/20 14:22
Start Date: 21/Jul/20 14:22
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r458135999



##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java
##
@@ -638,6 +657,12 @@ private void genObject(int aliasNum, boolean allLeftFirst, 
boolean allLeftNull)
   // skipping the rest of the rows in the rhs table of the semijoin
   done = !needsPostEvaluation;
 }
+  } else if (type == JoinDesc.ANTI_JOIN) {
+if (innerJoin(skip, left, right)) {
+  // if anti join found a match then the condition is not matched for 
anti join, so we can skip rest of the

Review comment:
   nit: if inner join found a match.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 461604)
Time Spent: 2h  (was: 1h 50m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-21 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461613&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461613
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 21/Jul/20 14:31
Start Date: 21/Jul/20 14:31
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r458143378



##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java
##
@@ -523,11 +533,19 @@ private boolean createForwardJoinObject(boolean[] skip) 
throws HiveException {
 forward = true;
   }
 }
+return forward;
+  }
+
+  // returns whether a record was forwarded
+  private boolean createForwardJoinObject(boolean[] skip, boolean antiJoin) 
throws HiveException {
+boolean forward = fillFwdCache(skip);
 if (forward) {
   if (needsPostEvaluation) {
 forward = !JoinUtil.isFiltered(forwardCache, residualJoinFilters, 
residualJoinFiltersOIs);
   }
-  if (forward) {
+
+  // For anti join, check all right side and if nothing is matched then 
only forward.

Review comment:
   Not sure I fully understand the comment here -- !forward (false) and 
antijoin (true) will still skip the object





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 461613)
Time Spent: 2h 10m  (was: 2h)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-21 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461615&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461615
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 21/Jul/20 14:37
Start Date: 21/Jul/20 14:37
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r458147849



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinGenerateResultOperator.java
##
@@ -0,0 +1,218 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec.vector.mapjoin;
+
+import org.apache.hadoop.hive.ql.CompilationOpContext;
+import org.apache.hadoop.hive.ql.exec.JoinUtil;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinHashSet;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinHashSetResult;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinHashTableResult;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.plan.OperatorDesc;
+import org.apache.hadoop.hive.ql.plan.VectorDesc;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+
+// TODO : This class is duplicate of semi join. Need to do a refactoring to 
merge it with semi join.
+/**
+ * This class has methods for generating vectorized join results for Anti 
joins.
+ * The big difference between inner joins and anti joins is existence testing.
+ * Inner joins use a hash map to lookup the 1 or more small table values.
+ * Anti joins are a specialized join for outputting big table rows whose key 
exists
+ * in the small table.
+ *
+ * No small table values are needed for anti since they would be empty.  So,
+ * we use a hash set as the hash table.  Hash sets just report whether a key 
exists.  This
+ * is a big performance optimization.
+ */
+public abstract class VectorMapJoinAntiJoinGenerateResultOperator
+extends VectorMapJoinGenerateResultOperator {
+
+  private static final long serialVersionUID = 1L;
+  private static final Logger LOG = 
LoggerFactory.getLogger(VectorMapJoinAntiJoinGenerateResultOperator.class.getName());
+
+  // Anti join specific members.
+
+  // An array of hash set results so we can do lookups on the whole batch 
before output result
+  // generation.
+  protected transient VectorMapJoinHashSetResult hashSetResults[];
+
+  // Pre-allocated member for storing the (physical) batch index of matching 
row (single- or
+  // multi-small-table-valued) indexes during a process call.
+  protected transient int[] allMatchs;
+
+  // Pre-allocated member for storing the (physical) batch index of rows that 
need to be spilled.
+  protected transient int[] spills;
+
+  // Pre-allocated member for storing index into the hashSetResults for each 
spilled row.
+  protected transient int[] spillHashMapResultIndices;
+
+  /** Kryo ctor. */
+  protected VectorMapJoinAntiJoinGenerateResultOperator() {
+super();
+  }
+
+  public VectorMapJoinAntiJoinGenerateResultOperator(CompilationOpContext ctx) 
{
+super(ctx);
+  }
+
+  public VectorMapJoinAntiJoinGenerateResultOperator(CompilationOpContext ctx, 
OperatorDesc conf,
+ VectorizationContext 
vContext, VectorDesc vectorDesc) throws HiveException {
+super(ctx, conf, vContext, vectorDesc);
+  }
+
+  /*
+   * Setup our anti join specific members.
+   */
+  protected void commonSetup() throws HiveException {
+super.commonSetup();
+
+// Anti join specific.
+VectorMapJoinHashSet baseHashSet = (VectorMapJoinHashSet) 
vectorMapJoinHashTable;
+
+hashSetResults = new 
VectorMapJoinHashSetResult[VectorizedRowBatch.DEFAULT_SIZE];
+for (int i = 0; i < hashSetRe

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-21 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461616&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461616
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 21/Jul/20 14:42
Start Date: 21/Jul/20 14:42
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r458151829



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinGenerateResultOperator.java
##
@@ -0,0 +1,218 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec.vector.mapjoin;
+
+import org.apache.hadoop.hive.ql.CompilationOpContext;
+import org.apache.hadoop.hive.ql.exec.JoinUtil;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinHashSet;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinHashSetResult;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinHashTableResult;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.plan.OperatorDesc;
+import org.apache.hadoop.hive.ql.plan.VectorDesc;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+
+// TODO : This class is duplicate of semi join. Need to do a refactoring to 
merge it with semi join.
+/**
+ * This class has methods for generating vectorized join results for Anti 
joins.
+ * The big difference between inner joins and anti joins is existence testing.
+ * Inner joins use a hash map to lookup the 1 or more small table values.
+ * Anti joins are a specialized join for outputting big table rows whose key 
exists

Review comment:
   nit: whose key DOES NOT exist





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 461616)
Time Spent: 2.5h  (was: 2h 20m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-21 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461619&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461619
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 21/Jul/20 14:49
Start Date: 21/Jul/20 14:49
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r458157216



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinLongOperator.java
##
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec.vector.mapjoin;
+
+import org.apache.hadoop.hive.ql.CompilationOpContext;
+import org.apache.hadoop.hive.ql.exec.JoinUtil;
+import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinLongHashSet;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.plan.OperatorDesc;
+import org.apache.hadoop.hive.ql.plan.VectorDesc;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Arrays;
+
+// TODO : Duplicate codes need to merge with semi join.
+// Single-Column Long hash table import.
+// Single-Column Long specific imports.
+
+/*
+ * Specialized class for doing a vectorized map join that is an anti join on a 
Single-Column Long
+ * using a hash set.
+ */
+public class VectorMapJoinAntiJoinLongOperator extends 
VectorMapJoinAntiJoinGenerateResultOperator {
+
+  private static final long serialVersionUID = 1L;
+  private static final String CLASS_NAME = 
VectorMapJoinAntiJoinLongOperator.class.getName();
+  private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME);
+  protected String getLoggingPrefix() {
+return super.getLoggingPrefix(CLASS_NAME);
+  }
+
+  // The above members are initialized by the constructor and must not be
+  // transient.
+
+  // The hash map for this specialized class.
+  private transient VectorMapJoinLongHashSet hashSet;
+
+  // Single-Column Long specific members.
+  // For integers, we have optional min/max filtering.
+  private transient boolean useMinMax;
+  private transient long min;
+  private transient long max;
+
+  // The column number for this one column join specialization.
+  private transient int singleJoinColumn;
+
+  // Pass-thru constructors.
+  /** Kryo ctor. */
+  protected VectorMapJoinAntiJoinLongOperator() {
+super();
+  }
+
+  public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx) {
+super(ctx);
+  }
+
+  public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx, 
OperatorDesc conf,
+   VectorizationContext vContext, 
VectorDesc vectorDesc) throws HiveException {
+super(ctx, conf, vContext, vectorDesc);
+  }
+
+  // Process Single-Column Long Anti Join on a vectorized row batch.
+  @Override
+  protected void commonSetup() throws HiveException {
+super.commonSetup();
+
+// Initialize Single-Column Long members for this specialized class.
+singleJoinColumn = bigTableKeyColumnMap[0];
+  }
+
+  @Override
+  public void hashTableSetup() throws HiveException {
+super.hashTableSetup();
+
+// Get our Single-Column Long hash set information for this specialized 
class.
+hashSet = (VectorMapJoinLongHashSet) vectorMapJoinHashTable;
+useMinMax = hashSet.useMinMax();
+if (useMinMax) {
+  min = hashSet.min();
+  max = hashSet.max();
+}
+  }
+
+  @Override
+  public void processBatch(VectorizedRowBatch batch) throws HiveException {
+
+try {
+  // (Currently none)

Review comment:
   leftover?





This is an automated message from the Apache Git Service.
To respond to the message, please log

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-21 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461620&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461620
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 21/Jul/20 14:51
Start Date: 21/Jul/20 14:51
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r458158828



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinLongOperator.java
##
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec.vector.mapjoin;
+
+import org.apache.hadoop.hive.ql.CompilationOpContext;
+import org.apache.hadoop.hive.ql.exec.JoinUtil;
+import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinLongHashSet;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.plan.OperatorDesc;
+import org.apache.hadoop.hive.ql.plan.VectorDesc;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Arrays;
+
+// TODO : Duplicate codes need to merge with semi join.
+// Single-Column Long hash table import.
+// Single-Column Long specific imports.
+
+/*
+ * Specialized class for doing a vectorized map join that is an anti join on a 
Single-Column Long
+ * using a hash set.
+ */
+public class VectorMapJoinAntiJoinLongOperator extends 
VectorMapJoinAntiJoinGenerateResultOperator {
+
+  private static final long serialVersionUID = 1L;
+  private static final String CLASS_NAME = 
VectorMapJoinAntiJoinLongOperator.class.getName();
+  private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME);
+  protected String getLoggingPrefix() {
+return super.getLoggingPrefix(CLASS_NAME);
+  }
+
+  // The above members are initialized by the constructor and must not be
+  // transient.
+
+  // The hash map for this specialized class.
+  private transient VectorMapJoinLongHashSet hashSet;
+
+  // Single-Column Long specific members.
+  // For integers, we have optional min/max filtering.
+  private transient boolean useMinMax;
+  private transient long min;
+  private transient long max;
+
+  // The column number for this one column join specialization.
+  private transient int singleJoinColumn;
+
+  // Pass-thru constructors.
+  /** Kryo ctor. */
+  protected VectorMapJoinAntiJoinLongOperator() {
+super();
+  }
+
+  public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx) {
+super(ctx);
+  }
+
+  public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx, 
OperatorDesc conf,
+   VectorizationContext vContext, 
VectorDesc vectorDesc) throws HiveException {
+super(ctx, conf, vContext, vectorDesc);
+  }
+
+  // Process Single-Column Long Anti Join on a vectorized row batch.
+  @Override
+  protected void commonSetup() throws HiveException {
+super.commonSetup();
+
+// Initialize Single-Column Long members for this specialized class.
+singleJoinColumn = bigTableKeyColumnMap[0];
+  }
+
+  @Override
+  public void hashTableSetup() throws HiveException {
+super.hashTableSetup();
+
+// Get our Single-Column Long hash set information for this specialized 
class.
+hashSet = (VectorMapJoinLongHashSet) vectorMapJoinHashTable;
+useMinMax = hashSet.useMinMax();
+if (useMinMax) {
+  min = hashSet.min();
+  max = hashSet.max();
+}
+  }
+
+  @Override
+  public void processBatch(VectorizedRowBatch batch) throws HiveException {
+
+try {
+  // (Currently none)
+  // antiPerBatchSetup(batch);
+
+  // For anti joins, we may apply the filter(s) now.
+  for(VectorExpression ve : bigTableFilterExpressions) {
+ve.evaluate(batch);
+  }
+

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-21 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461622&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461622
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 21/Jul/20 14:52
Start Date: 21/Jul/20 14:52
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r458158828



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinLongOperator.java
##
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec.vector.mapjoin;
+
+import org.apache.hadoop.hive.ql.CompilationOpContext;
+import org.apache.hadoop.hive.ql.exec.JoinUtil;
+import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinLongHashSet;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.plan.OperatorDesc;
+import org.apache.hadoop.hive.ql.plan.VectorDesc;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Arrays;
+
+// TODO : Duplicate codes need to merge with semi join.
+// Single-Column Long hash table import.
+// Single-Column Long specific imports.
+
+/*
+ * Specialized class for doing a vectorized map join that is an anti join on a 
Single-Column Long
+ * using a hash set.
+ */
+public class VectorMapJoinAntiJoinLongOperator extends 
VectorMapJoinAntiJoinGenerateResultOperator {
+
+  private static final long serialVersionUID = 1L;
+  private static final String CLASS_NAME = 
VectorMapJoinAntiJoinLongOperator.class.getName();
+  private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME);
+  protected String getLoggingPrefix() {
+return super.getLoggingPrefix(CLASS_NAME);
+  }
+
+  // The above members are initialized by the constructor and must not be
+  // transient.
+
+  // The hash map for this specialized class.
+  private transient VectorMapJoinLongHashSet hashSet;
+
+  // Single-Column Long specific members.
+  // For integers, we have optional min/max filtering.
+  private transient boolean useMinMax;
+  private transient long min;
+  private transient long max;
+
+  // The column number for this one column join specialization.
+  private transient int singleJoinColumn;
+
+  // Pass-thru constructors.
+  /** Kryo ctor. */
+  protected VectorMapJoinAntiJoinLongOperator() {
+super();
+  }
+
+  public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx) {
+super(ctx);
+  }
+
+  public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx, 
OperatorDesc conf,
+   VectorizationContext vContext, 
VectorDesc vectorDesc) throws HiveException {
+super(ctx, conf, vContext, vectorDesc);
+  }
+
+  // Process Single-Column Long Anti Join on a vectorized row batch.
+  @Override
+  protected void commonSetup() throws HiveException {
+super.commonSetup();
+
+// Initialize Single-Column Long members for this specialized class.
+singleJoinColumn = bigTableKeyColumnMap[0];
+  }
+
+  @Override
+  public void hashTableSetup() throws HiveException {
+super.hashTableSetup();
+
+// Get our Single-Column Long hash set information for this specialized 
class.
+hashSet = (VectorMapJoinLongHashSet) vectorMapJoinHashTable;
+useMinMax = hashSet.useMinMax();
+if (useMinMax) {
+  min = hashSet.min();
+  max = hashSet.max();
+}
+  }
+
+  @Override
+  public void processBatch(VectorizedRowBatch batch) throws HiveException {
+
+try {
+  // (Currently none)
+  // antiPerBatchSetup(batch);
+
+  // For anti joins, we may apply the filter(s) now.
+  for(VectorExpression ve : bigTableFilterExpressions) {
+ve.evaluate(batch);
+  }
+

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-21 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461626&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461626
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 21/Jul/20 14:56
Start Date: 21/Jul/20 14:56
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r458162559



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinLongOperator.java
##
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec.vector.mapjoin;
+
+import org.apache.hadoop.hive.ql.CompilationOpContext;
+import org.apache.hadoop.hive.ql.exec.JoinUtil;
+import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinLongHashSet;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.plan.OperatorDesc;
+import org.apache.hadoop.hive.ql.plan.VectorDesc;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Arrays;
+
+// TODO : Duplicate codes need to merge with semi join.
+// Single-Column Long hash table import.
+// Single-Column Long specific imports.
+
+/*
+ * Specialized class for doing a vectorized map join that is an anti join on a 
Single-Column Long
+ * using a hash set.
+ */
+public class VectorMapJoinAntiJoinLongOperator extends 
VectorMapJoinAntiJoinGenerateResultOperator {
+
+  private static final long serialVersionUID = 1L;
+  private static final String CLASS_NAME = 
VectorMapJoinAntiJoinLongOperator.class.getName();
+  private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME);
+  protected String getLoggingPrefix() {
+return super.getLoggingPrefix(CLASS_NAME);
+  }
+
+  // The above members are initialized by the constructor and must not be
+  // transient.
+
+  // The hash map for this specialized class.
+  private transient VectorMapJoinLongHashSet hashSet;
+
+  // Single-Column Long specific members.
+  // For integers, we have optional min/max filtering.
+  private transient boolean useMinMax;
+  private transient long min;
+  private transient long max;
+
+  // The column number for this one column join specialization.
+  private transient int singleJoinColumn;
+
+  // Pass-thru constructors.
+  /** Kryo ctor. */
+  protected VectorMapJoinAntiJoinLongOperator() {
+super();
+  }
+
+  public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx) {
+super(ctx);
+  }
+
+  public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx, 
OperatorDesc conf,
+   VectorizationContext vContext, 
VectorDesc vectorDesc) throws HiveException {
+super(ctx, conf, vContext, vectorDesc);
+  }
+
+  // Process Single-Column Long Anti Join on a vectorized row batch.
+  @Override
+  protected void commonSetup() throws HiveException {
+super.commonSetup();
+
+// Initialize Single-Column Long members for this specialized class.
+singleJoinColumn = bigTableKeyColumnMap[0];
+  }
+
+  @Override
+  public void hashTableSetup() throws HiveException {
+super.hashTableSetup();
+
+// Get our Single-Column Long hash set information for this specialized 
class.
+hashSet = (VectorMapJoinLongHashSet) vectorMapJoinHashTable;
+useMinMax = hashSet.useMinMax();
+if (useMinMax) {
+  min = hashSet.min();
+  max = hashSet.max();
+}
+  }
+
+  @Override
+  public void processBatch(VectorizedRowBatch batch) throws HiveException {
+
+try {
+  // (Currently none)
+  // antiPerBatchSetup(batch);
+
+  // For anti joins, we may apply the filter(s) now.
+  for(VectorExpression ve : bigTableFilterExpressions) {
+ve.evaluate(batch);
+  }
+

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-21 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461627&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461627
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 21/Jul/20 14:57
Start Date: 21/Jul/20 14:57
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r458162559



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinLongOperator.java
##
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec.vector.mapjoin;
+
+import org.apache.hadoop.hive.ql.CompilationOpContext;
+import org.apache.hadoop.hive.ql.exec.JoinUtil;
+import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinLongHashSet;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.plan.OperatorDesc;
+import org.apache.hadoop.hive.ql.plan.VectorDesc;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Arrays;
+
+// TODO : Duplicate codes need to merge with semi join.
+// Single-Column Long hash table import.
+// Single-Column Long specific imports.
+
+/*
+ * Specialized class for doing a vectorized map join that is an anti join on a 
Single-Column Long
+ * using a hash set.
+ */
+public class VectorMapJoinAntiJoinLongOperator extends 
VectorMapJoinAntiJoinGenerateResultOperator {
+
+  private static final long serialVersionUID = 1L;
+  private static final String CLASS_NAME = 
VectorMapJoinAntiJoinLongOperator.class.getName();
+  private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME);
+  protected String getLoggingPrefix() {
+return super.getLoggingPrefix(CLASS_NAME);
+  }
+
+  // The above members are initialized by the constructor and must not be
+  // transient.
+
+  // The hash map for this specialized class.
+  private transient VectorMapJoinLongHashSet hashSet;
+
+  // Single-Column Long specific members.
+  // For integers, we have optional min/max filtering.
+  private transient boolean useMinMax;
+  private transient long min;
+  private transient long max;
+
+  // The column number for this one column join specialization.
+  private transient int singleJoinColumn;
+
+  // Pass-thru constructors.
+  /** Kryo ctor. */
+  protected VectorMapJoinAntiJoinLongOperator() {
+super();
+  }
+
+  public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx) {
+super(ctx);
+  }
+
+  public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx, 
OperatorDesc conf,
+   VectorizationContext vContext, 
VectorDesc vectorDesc) throws HiveException {
+super(ctx, conf, vContext, vectorDesc);
+  }
+
+  // Process Single-Column Long Anti Join on a vectorized row batch.
+  @Override
+  protected void commonSetup() throws HiveException {
+super.commonSetup();
+
+// Initialize Single-Column Long members for this specialized class.
+singleJoinColumn = bigTableKeyColumnMap[0];
+  }
+
+  @Override
+  public void hashTableSetup() throws HiveException {
+super.hashTableSetup();
+
+// Get our Single-Column Long hash set information for this specialized 
class.
+hashSet = (VectorMapJoinLongHashSet) vectorMapJoinHashTable;
+useMinMax = hashSet.useMinMax();
+if (useMinMax) {
+  min = hashSet.min();
+  max = hashSet.max();
+}
+  }
+
+  @Override
+  public void processBatch(VectorizedRowBatch batch) throws HiveException {
+
+try {
+  // (Currently none)
+  // antiPerBatchSetup(batch);
+
+  // For anti joins, we may apply the filter(s) now.
+  for(VectorExpression ve : bigTableFilterExpressions) {
+ve.evaluate(batch);
+  }
+

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-21 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461629&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461629
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 21/Jul/20 15:00
Start Date: 21/Jul/20 15:00
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r458165844



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinLongOperator.java
##
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec.vector.mapjoin;
+
+import org.apache.hadoop.hive.ql.CompilationOpContext;
+import org.apache.hadoop.hive.ql.exec.JoinUtil;
+import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinLongHashSet;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.plan.OperatorDesc;
+import org.apache.hadoop.hive.ql.plan.VectorDesc;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Arrays;
+
+// TODO : Duplicate codes need to merge with semi join.
+// Single-Column Long hash table import.
+// Single-Column Long specific imports.
+
+/*
+ * Specialized class for doing a vectorized map join that is an anti join on a 
Single-Column Long
+ * using a hash set.
+ */
+public class VectorMapJoinAntiJoinLongOperator extends 
VectorMapJoinAntiJoinGenerateResultOperator {
+
+  private static final long serialVersionUID = 1L;
+  private static final String CLASS_NAME = 
VectorMapJoinAntiJoinLongOperator.class.getName();
+  private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME);
+  protected String getLoggingPrefix() {
+return super.getLoggingPrefix(CLASS_NAME);
+  }
+
+  // The above members are initialized by the constructor and must not be
+  // transient.
+
+  // The hash map for this specialized class.
+  private transient VectorMapJoinLongHashSet hashSet;
+
+  // Single-Column Long specific members.
+  // For integers, we have optional min/max filtering.
+  private transient boolean useMinMax;
+  private transient long min;
+  private transient long max;
+
+  // The column number for this one column join specialization.
+  private transient int singleJoinColumn;
+
+  // Pass-thru constructors.
+  /** Kryo ctor. */
+  protected VectorMapJoinAntiJoinLongOperator() {
+super();
+  }
+
+  public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx) {
+super(ctx);
+  }
+
+  public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx, 
OperatorDesc conf,
+   VectorizationContext vContext, 
VectorDesc vectorDesc) throws HiveException {
+super(ctx, conf, vContext, vectorDesc);
+  }
+
+  // Process Single-Column Long Anti Join on a vectorized row batch.
+  @Override
+  protected void commonSetup() throws HiveException {
+super.commonSetup();
+
+// Initialize Single-Column Long members for this specialized class.
+singleJoinColumn = bigTableKeyColumnMap[0];
+  }
+
+  @Override
+  public void hashTableSetup() throws HiveException {
+super.hashTableSetup();
+
+// Get our Single-Column Long hash set information for this specialized 
class.
+hashSet = (VectorMapJoinLongHashSet) vectorMapJoinHashTable;
+useMinMax = hashSet.useMinMax();
+if (useMinMax) {
+  min = hashSet.min();
+  max = hashSet.max();
+}
+  }
+
+  @Override
+  public void processBatch(VectorizedRowBatch batch) throws HiveException {
+
+try {
+  // (Currently none)
+  // antiPerBatchSetup(batch);
+
+  // For anti joins, we may apply the filter(s) now.
+  for(VectorExpression ve : bigTableFilterExpressions) {
+ve.evaluate(batch);
+  }
+

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-21 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461632&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461632
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 21/Jul/20 15:02
Start Date: 21/Jul/20 15:02
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r458167515



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinLongOperator.java
##
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec.vector.mapjoin;
+
+import org.apache.hadoop.hive.ql.CompilationOpContext;
+import org.apache.hadoop.hive.ql.exec.JoinUtil;
+import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinLongHashSet;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.plan.OperatorDesc;
+import org.apache.hadoop.hive.ql.plan.VectorDesc;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Arrays;
+
+// TODO : Duplicate codes need to merge with semi join.
+// Single-Column Long hash table import.
+// Single-Column Long specific imports.
+
+/*
+ * Specialized class for doing a vectorized map join that is an anti join on a 
Single-Column Long
+ * using a hash set.
+ */
+public class VectorMapJoinAntiJoinLongOperator extends 
VectorMapJoinAntiJoinGenerateResultOperator {
+
+  private static final long serialVersionUID = 1L;
+  private static final String CLASS_NAME = 
VectorMapJoinAntiJoinLongOperator.class.getName();
+  private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME);
+  protected String getLoggingPrefix() {
+return super.getLoggingPrefix(CLASS_NAME);
+  }
+
+  // The above members are initialized by the constructor and must not be
+  // transient.
+
+  // The hash map for this specialized class.
+  private transient VectorMapJoinLongHashSet hashSet;
+
+  // Single-Column Long specific members.
+  // For integers, we have optional min/max filtering.
+  private transient boolean useMinMax;
+  private transient long min;
+  private transient long max;
+
+  // The column number for this one column join specialization.
+  private transient int singleJoinColumn;
+
+  // Pass-thru constructors.
+  /** Kryo ctor. */
+  protected VectorMapJoinAntiJoinLongOperator() {
+super();
+  }
+
+  public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx) {
+super(ctx);
+  }
+
+  public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx, 
OperatorDesc conf,
+   VectorizationContext vContext, 
VectorDesc vectorDesc) throws HiveException {
+super(ctx, conf, vContext, vectorDesc);
+  }
+
+  // Process Single-Column Long Anti Join on a vectorized row batch.
+  @Override
+  protected void commonSetup() throws HiveException {
+super.commonSetup();
+
+// Initialize Single-Column Long members for this specialized class.
+singleJoinColumn = bigTableKeyColumnMap[0];
+  }
+
+  @Override
+  public void hashTableSetup() throws HiveException {
+super.hashTableSetup();
+
+// Get our Single-Column Long hash set information for this specialized 
class.
+hashSet = (VectorMapJoinLongHashSet) vectorMapJoinHashTable;
+useMinMax = hashSet.useMinMax();
+if (useMinMax) {
+  min = hashSet.min();
+  max = hashSet.max();
+}
+  }
+
+  @Override
+  public void processBatch(VectorizedRowBatch batch) throws HiveException {
+
+try {
+  // (Currently none)
+  // antiPerBatchSetup(batch);
+
+  // For anti joins, we may apply the filter(s) now.
+  for(VectorExpression ve : bigTableFilterExpressions) {
+ve.evaluate(batch);
+  }
+

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-21 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461630&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461630
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 21/Jul/20 15:02
Start Date: 21/Jul/20 15:02
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r458167104



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinLongOperator.java
##
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec.vector.mapjoin;
+
+import org.apache.hadoop.hive.ql.CompilationOpContext;
+import org.apache.hadoop.hive.ql.exec.JoinUtil;
+import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinLongHashSet;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.plan.OperatorDesc;
+import org.apache.hadoop.hive.ql.plan.VectorDesc;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Arrays;
+
+// TODO : Duplicate codes need to merge with semi join.
+// Single-Column Long hash table import.
+// Single-Column Long specific imports.
+
+/*
+ * Specialized class for doing a vectorized map join that is an anti join on a 
Single-Column Long
+ * using a hash set.
+ */
+public class VectorMapJoinAntiJoinLongOperator extends 
VectorMapJoinAntiJoinGenerateResultOperator {
+
+  private static final long serialVersionUID = 1L;
+  private static final String CLASS_NAME = 
VectorMapJoinAntiJoinLongOperator.class.getName();
+  private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME);
+  protected String getLoggingPrefix() {
+return super.getLoggingPrefix(CLASS_NAME);
+  }
+
+  // The above members are initialized by the constructor and must not be
+  // transient.
+
+  // The hash map for this specialized class.
+  private transient VectorMapJoinLongHashSet hashSet;
+
+  // Single-Column Long specific members.
+  // For integers, we have optional min/max filtering.
+  private transient boolean useMinMax;
+  private transient long min;
+  private transient long max;
+
+  // The column number for this one column join specialization.
+  private transient int singleJoinColumn;
+
+  // Pass-thru constructors.
+  /** Kryo ctor. */
+  protected VectorMapJoinAntiJoinLongOperator() {
+super();
+  }
+
+  public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx) {
+super(ctx);
+  }
+
+  public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx, 
OperatorDesc conf,
+   VectorizationContext vContext, 
VectorDesc vectorDesc) throws HiveException {
+super(ctx, conf, vContext, vectorDesc);
+  }
+
+  // Process Single-Column Long Anti Join on a vectorized row batch.
+  @Override
+  protected void commonSetup() throws HiveException {
+super.commonSetup();
+
+// Initialize Single-Column Long members for this specialized class.
+singleJoinColumn = bigTableKeyColumnMap[0];
+  }
+
+  @Override
+  public void hashTableSetup() throws HiveException {
+super.hashTableSetup();
+
+// Get our Single-Column Long hash set information for this specialized 
class.
+hashSet = (VectorMapJoinLongHashSet) vectorMapJoinHashTable;
+useMinMax = hashSet.useMinMax();
+if (useMinMax) {
+  min = hashSet.min();
+  max = hashSet.max();
+}
+  }
+
+  @Override
+  public void processBatch(VectorizedRowBatch batch) throws HiveException {
+
+try {
+  // (Currently none)
+  // antiPerBatchSetup(batch);
+
+  // For anti joins, we may apply the filter(s) now.
+  for(VectorExpression ve : bigTableFilterExpressions) {
+ve.evaluate(batch);
+  }
+

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-21 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461633&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461633
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 21/Jul/20 15:05
Start Date: 21/Jul/20 15:05
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r458169457



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinLongOperator.java
##
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec.vector.mapjoin;
+
+import org.apache.hadoop.hive.ql.CompilationOpContext;
+import org.apache.hadoop.hive.ql.exec.JoinUtil;
+import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinLongHashSet;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.plan.OperatorDesc;
+import org.apache.hadoop.hive.ql.plan.VectorDesc;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Arrays;
+
+// TODO : Duplicate codes need to merge with semi join.
+// Single-Column Long hash table import.
+// Single-Column Long specific imports.
+
+/*
+ * Specialized class for doing a vectorized map join that is an anti join on a 
Single-Column Long
+ * using a hash set.
+ */
+public class VectorMapJoinAntiJoinLongOperator extends 
VectorMapJoinAntiJoinGenerateResultOperator {
+
+  private static final long serialVersionUID = 1L;
+  private static final String CLASS_NAME = 
VectorMapJoinAntiJoinLongOperator.class.getName();
+  private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME);
+  protected String getLoggingPrefix() {
+return super.getLoggingPrefix(CLASS_NAME);
+  }
+
+  // The above members are initialized by the constructor and must not be
+  // transient.
+
+  // The hash map for this specialized class.
+  private transient VectorMapJoinLongHashSet hashSet;
+
+  // Single-Column Long specific members.
+  // For integers, we have optional min/max filtering.
+  private transient boolean useMinMax;
+  private transient long min;
+  private transient long max;
+
+  // The column number for this one column join specialization.
+  private transient int singleJoinColumn;
+
+  // Pass-thru constructors.
+  /** Kryo ctor. */
+  protected VectorMapJoinAntiJoinLongOperator() {
+super();
+  }
+
+  public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx) {
+super(ctx);
+  }
+
+  public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx, 
OperatorDesc conf,
+   VectorizationContext vContext, 
VectorDesc vectorDesc) throws HiveException {
+super(ctx, conf, vContext, vectorDesc);
+  }
+
+  // Process Single-Column Long Anti Join on a vectorized row batch.
+  @Override
+  protected void commonSetup() throws HiveException {
+super.commonSetup();
+
+// Initialize Single-Column Long members for this specialized class.
+singleJoinColumn = bigTableKeyColumnMap[0];
+  }
+
+  @Override
+  public void hashTableSetup() throws HiveException {
+super.hashTableSetup();
+
+// Get our Single-Column Long hash set information for this specialized 
class.
+hashSet = (VectorMapJoinLongHashSet) vectorMapJoinHashTable;
+useMinMax = hashSet.useMinMax();
+if (useMinMax) {
+  min = hashSet.min();
+  max = hashSet.max();
+}
+  }
+
+  @Override
+  public void processBatch(VectorizedRowBatch batch) throws HiveException {
+
+try {
+  // (Currently none)
+  // antiPerBatchSetup(batch);
+
+  // For anti joins, we may apply the filter(s) now.
+  for(VectorExpression ve : bigTableFilterExpressions) {
+ve.evaluate(batch);
+  }
+

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-21 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461634&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461634
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 21/Jul/20 15:05
Start Date: 21/Jul/20 15:05
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r458169723



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinMultiKeyOperator.java
##
@@ -0,0 +1,400 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec.vector.mapjoin;
+
+import org.apache.hadoop.hive.ql.CompilationOpContext;
+import org.apache.hadoop.hive.ql.exec.JoinUtil;
+import org.apache.hadoop.hive.ql.exec.vector.VectorSerializeRow;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinBytesHashSet;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.plan.OperatorDesc;
+import org.apache.hadoop.hive.ql.plan.VectorDesc;
+import org.apache.hadoop.hive.serde2.ByteStream.Output;
+import 
org.apache.hadoop.hive.serde2.binarysortable.fast.BinarySortableSerializeWrite;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Arrays;
+
+// Multi-Key hash table import.
+// Multi-Key specific imports.
+
+// TODO : Duplicate codes need to merge with semi join.
+/*
+ * Specialized class for doing a vectorized map join that is an anti join on 
Multi-Key
+ * using hash set.
+ */
+public class VectorMapJoinAntiJoinMultiKeyOperator extends 
VectorMapJoinAntiJoinGenerateResultOperator {
+
+  private static final long serialVersionUID = 1L;
+
+  
//
+
+  private static final String CLASS_NAME = 
VectorMapJoinAntiJoinMultiKeyOperator.class.getName();
+  private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME);
+
+  protected String getLoggingPrefix() {
+return super.getLoggingPrefix(CLASS_NAME);
+  }
+
+  
//
+
+  // (none)
+
+  // The above members are initialized by the constructor and must not be
+  // transient.
+  //---
+
+  // The hash map for this specialized class.
+  private transient VectorMapJoinBytesHashSet hashSet;
+
+  //---
+  // Multi-Key specific members.
+  //
+
+  // Object that can take a set of columns in row in a vectorized row batch 
and serialized it.
+  // Known to not have any nulls.
+  private transient VectorSerializeRow keyVectorSerializeWrite;
+
+  // The BinarySortable serialization of the current key.
+  private transient Output currentKeyOutput;
+
+  // The BinarySortable serialization of the saved key for a possible series 
of equal keys.
+  private transient Output saveKeyOutput;
+
+  //---
+  // Pass-thru constructors.
+  //
+
+  /** Kryo ctor. */
+  protected VectorMapJoinAntiJoinMultiKeyOperator() {
+super();
+  }
+
+  public VectorMapJoinAntiJoinMultiKeyOperator(CompilationOpContext ctx) {
+super(ctx);
+  }
+
+  public VectorMapJoinAntiJoinMultiKeyOperator(CompilationOpContext ctx, 
OperatorDesc conf,
+   VectorizationContext vContext, 
VectorDesc vectorDesc) throws HiveException {
+super(ctx, conf, vContext, vectorDesc);
+  }
+
+  //---
+  // Process Multi-Key Anti Join on a vectorized row batch.
+  //
+
+  @Override
+  protected void commonSetup() throws H

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-21 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461628&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461628
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 21/Jul/20 14:58
Start Date: 21/Jul/20 14:58
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r458164592



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinLongOperator.java
##
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec.vector.mapjoin;
+
+import org.apache.hadoop.hive.ql.CompilationOpContext;
+import org.apache.hadoop.hive.ql.exec.JoinUtil;
+import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinLongHashSet;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.plan.OperatorDesc;
+import org.apache.hadoop.hive.ql.plan.VectorDesc;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Arrays;
+
+// TODO : Duplicate codes need to merge with semi join.
+// Single-Column Long hash table import.
+// Single-Column Long specific imports.
+
+/*
+ * Specialized class for doing a vectorized map join that is an anti join on a 
Single-Column Long
+ * using a hash set.
+ */
+public class VectorMapJoinAntiJoinLongOperator extends 
VectorMapJoinAntiJoinGenerateResultOperator {
+
+  private static final long serialVersionUID = 1L;
+  private static final String CLASS_NAME = 
VectorMapJoinAntiJoinLongOperator.class.getName();
+  private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME);
+  protected String getLoggingPrefix() {
+return super.getLoggingPrefix(CLASS_NAME);
+  }
+
+  // The above members are initialized by the constructor and must not be
+  // transient.
+
+  // The hash map for this specialized class.
+  private transient VectorMapJoinLongHashSet hashSet;
+
+  // Single-Column Long specific members.
+  // For integers, we have optional min/max filtering.
+  private transient boolean useMinMax;
+  private transient long min;
+  private transient long max;
+
+  // The column number for this one column join specialization.
+  private transient int singleJoinColumn;
+
+  // Pass-thru constructors.
+  /** Kryo ctor. */
+  protected VectorMapJoinAntiJoinLongOperator() {
+super();
+  }
+
+  public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx) {
+super(ctx);
+  }
+
+  public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx, 
OperatorDesc conf,
+   VectorizationContext vContext, 
VectorDesc vectorDesc) throws HiveException {
+super(ctx, conf, vContext, vectorDesc);
+  }
+
+  // Process Single-Column Long Anti Join on a vectorized row batch.
+  @Override
+  protected void commonSetup() throws HiveException {
+super.commonSetup();
+
+// Initialize Single-Column Long members for this specialized class.
+singleJoinColumn = bigTableKeyColumnMap[0];
+  }
+
+  @Override
+  public void hashTableSetup() throws HiveException {
+super.hashTableSetup();
+
+// Get our Single-Column Long hash set information for this specialized 
class.
+hashSet = (VectorMapJoinLongHashSet) vectorMapJoinHashTable;
+useMinMax = hashSet.useMinMax();
+if (useMinMax) {
+  min = hashSet.min();
+  max = hashSet.max();
+}
+  }
+
+  @Override
+  public void processBatch(VectorizedRowBatch batch) throws HiveException {
+
+try {
+  // (Currently none)
+  // antiPerBatchSetup(batch);
+
+  // For anti joins, we may apply the filter(s) now.
+  for(VectorExpression ve : bigTableFilterExpressions) {
+ve.evaluate(batch);
+  }
+

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-21 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461636&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461636
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 21/Jul/20 15:17
Start Date: 21/Jul/20 15:17
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r458178786



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinMultiKeyOperator.java
##
@@ -0,0 +1,400 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec.vector.mapjoin;
+
+import org.apache.hadoop.hive.ql.CompilationOpContext;
+import org.apache.hadoop.hive.ql.exec.JoinUtil;
+import org.apache.hadoop.hive.ql.exec.vector.VectorSerializeRow;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinBytesHashSet;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.plan.OperatorDesc;
+import org.apache.hadoop.hive.ql.plan.VectorDesc;
+import org.apache.hadoop.hive.serde2.ByteStream.Output;
+import 
org.apache.hadoop.hive.serde2.binarysortable.fast.BinarySortableSerializeWrite;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Arrays;
+
+// Multi-Key hash table import.
+// Multi-Key specific imports.
+
+// TODO : Duplicate codes need to merge with semi join.
+/*
+ * Specialized class for doing a vectorized map join that is an anti join on 
Multi-Key
+ * using hash set.
+ */
+public class VectorMapJoinAntiJoinMultiKeyOperator extends 
VectorMapJoinAntiJoinGenerateResultOperator {
+
+  private static final long serialVersionUID = 1L;
+
+  
//
+
+  private static final String CLASS_NAME = 
VectorMapJoinAntiJoinMultiKeyOperator.class.getName();
+  private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME);
+
+  protected String getLoggingPrefix() {
+return super.getLoggingPrefix(CLASS_NAME);
+  }
+
+  
//
+
+  // (none)
+
+  // The above members are initialized by the constructor and must not be
+  // transient.
+  //---
+
+  // The hash map for this specialized class.
+  private transient VectorMapJoinBytesHashSet hashSet;
+
+  //---
+  // Multi-Key specific members.
+  //
+
+  // Object that can take a set of columns in row in a vectorized row batch 
and serialized it.
+  // Known to not have any nulls.
+  private transient VectorSerializeRow keyVectorSerializeWrite;
+
+  // The BinarySortable serialization of the current key.
+  private transient Output currentKeyOutput;
+
+  // The BinarySortable serialization of the saved key for a possible series 
of equal keys.
+  private transient Output saveKeyOutput;
+
+  //---
+  // Pass-thru constructors.
+  //
+
+  /** Kryo ctor. */
+  protected VectorMapJoinAntiJoinMultiKeyOperator() {
+super();
+  }
+
+  public VectorMapJoinAntiJoinMultiKeyOperator(CompilationOpContext ctx) {
+super(ctx);
+  }
+
+  public VectorMapJoinAntiJoinMultiKeyOperator(CompilationOpContext ctx, 
OperatorDesc conf,
+   VectorizationContext vContext, 
VectorDesc vectorDesc) throws HiveException {
+super(ctx, conf, vContext, vectorDesc);
+  }
+
+  //---
+  // Process Multi-Key Anti Join on a vectorized row batch.
+  //
+
+  @Override
+  protected void commonSetup() throws H

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-21 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461637&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461637
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 21/Jul/20 15:19
Start Date: 21/Jul/20 15:19
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r458180264



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinStringOperator.java
##
@@ -0,0 +1,371 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec.vector.mapjoin;
+
+import org.apache.hadoop.hive.ql.CompilationOpContext;
+import org.apache.hadoop.hive.ql.exec.JoinUtil;
+import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.exec.vector.expressions.StringExpr;
+import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinBytesHashSet;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.plan.OperatorDesc;
+import org.apache.hadoop.hive.ql.plan.VectorDesc;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Arrays;
+
+// Single-Column String hash table import.
+// Single-Column String specific imports.
+
+// TODO : Duplicate codes need to merge with semi join.
+/*
+ * Specialized class for doing a vectorized map join that is an anti join on a 
Single-Column String
+ * using a hash set.
+ */
+public class VectorMapJoinAntiJoinStringOperator extends 
VectorMapJoinAntiJoinGenerateResultOperator {
+
+  private static final long serialVersionUID = 1L;
+
+  
//
+
+  private static final String CLASS_NAME = 
VectorMapJoinAntiJoinStringOperator.class.getName();
+  private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME);
+
+  protected String getLoggingPrefix() {
+return super.getLoggingPrefix(CLASS_NAME);
+  }
+
+  
//
+
+  // (none)
+
+  // The above members are initialized by the constructor and must not be
+  // transient.
+  //---
+
+  // The hash map for this specialized class.
+  private transient VectorMapJoinBytesHashSet hashSet;
+
+  //---
+  // Single-Column String specific members.
+  //
+
+  // The column number for this one column join specialization.
+  private transient int singleJoinColumn;
+
+  //---
+  // Pass-thru constructors.
+  //
+
+  /** Kryo ctor. */
+  protected VectorMapJoinAntiJoinStringOperator() {
+super();
+  }
+
+  public VectorMapJoinAntiJoinStringOperator(CompilationOpContext ctx) {
+super(ctx);
+  }
+
+  public VectorMapJoinAntiJoinStringOperator(CompilationOpContext ctx, 
OperatorDesc conf,
+ VectorizationContext vContext, 
VectorDesc vectorDesc) throws HiveException {
+super(ctx, conf, vContext, vectorDesc);
+  }
+
+  //---
+  // Process Single-Column String anti Join on a vectorized row batch.
+  //
+
+  @Override
+  protected void commonSetup() throws HiveException {
+super.commonSetup();
+
+/*
+ * Initialize Single-Column String members for this specialized class.
+ */
+
+singleJoinColumn = bigTableKeyColumnMap[0];
+  }
+
+  @Override
+  public void hashTableSetup() throws HiveException {
+super.hashTableSetup();
+
+/*
+ * Get our Single-Column String hash set information for this

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-21 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461643&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461643
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 21/Jul/20 15:33
Start Date: 21/Jul/20 15:33
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r458190642



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinAddNotNullRule.java
##
@@ -74,7 +78,14 @@ public HiveJoinAddNotNullRule(Class clazz,
   @Override
   public void onMatch(RelOptRuleCall call) {
 Join join = call.rel(0);
-if (join.getJoinType() == JoinRelType.FULL || 
join.getCondition().isAlwaysTrue()) {
+
+// For anti join case add the not null on right side if the condition is

Review comment:
   Not sure I understand the issue here -- is the problem the fact that 
ANTI-join matches with NULL rows on the right side?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 461643)
Time Spent: 4h 50m  (was: 4h 40m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-21 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461650&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461650
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 21/Jul/20 15:53
Start Date: 21/Jul/20 15:53
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r458205456



##
File path: ql/src/java/org/apache/hadoop/hive/ql/ppd/SyntheticJoinPredicate.java
##
@@ -339,6 +339,12 @@ String getFuncText(String funcText, final int srcPos) {
   vector.add(right, left);
   break;
 case JoinDesc.LEFT_OUTER_JOIN:
+case JoinDesc.ANTI_JOIN:
+//TODO : In case of anti join, bloom filter can be created on left 
side also ("IN (keylist right table)").
+// But the filter should be "not-in" ("NOT IN (keylist right table)") 
as we want to select the records from
+// left side which are not present in the right side. But it may cause 
wrong result as
+// bloom filter may have false positive and thus simply adding not is 
not correct,
+// special handling is required for "NOT IN".

Review comment:
   Makes sense, for this particular purpose in the future we could 
something like ``The opossite bloom filter`` to support such cases 
   https://github.com/jmhodges/opposite_of_a_bloom_filter/





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 461650)
Time Spent: 5h  (was: 4h 50m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 5h
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-21 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461651&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461651
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 21/Jul/20 15:53
Start Date: 21/Jul/20 15:53
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r458205958



##
File path: 
ql/src/test/org/apache/hadoop/hive/ql/exec/vector/mapjoin/TestMapJoinOperator.java
##
@@ -1792,6 +1794,8 @@ private void executeTest(MapJoinTestDescription testDesc, 
MapJoinTestData testDa
 case FULL_OUTER:
   executeTestFullOuter(testDesc, testData, title);
   break;
+case ANTI: //TODO

Review comment:
   Shall we open a ticket to track this? What is the main challenge here?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 461651)
Time Spent: 5h 10m  (was: 5h)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-21 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461840&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461840
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 22/Jul/20 03:19
Start Date: 22/Jul/20 03:19
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r457831207



##
File path: parser/src/java/org/apache/hadoop/hive/ql/parse/FromClauseParser.g
##
@@ -145,6 +145,7 @@ joinToken
 | KW_RIGHT (KW_OUTER)? KW_JOIN -> TOK_RIGHTOUTERJOIN
 | KW_FULL  (KW_OUTER)? KW_JOIN -> TOK_FULLOUTERJOIN
 | KW_LEFT KW_SEMI KW_JOIN  -> TOK_LEFTSEMIJOIN
+| KW_ANTI KW_JOIN  -> TOK_ANTIJOIN

Review comment:
   Since we are exposing this and to prevent any ambiguity, should we use:
   
   `KW_LEFT KW_ANTI KW_JOIN -> TOK_LEFTANTISEMIJOIN`

##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java
##
@@ -509,11 +513,17 @@ protected void addToAliasFilterTags(byte alias, 
List object, boolean isN
 }
   }
 
+  private void createForwardJoinObjectForAntiJoin(boolean[] skip) throws 
HiveException {
+boolean forward = fillFwdCache(skip);

Review comment:
   nit. Fwd -> Forward

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveRelOptUtil.java
##
@@ -747,7 +747,7 @@ public static RewritablePKFKJoinInfo 
isRewritablePKFKJoin(Join join,
 final RelNode nonFkInput = leftInputPotentialFK ? join.getRight() : 
join.getLeft();
 final RewritablePKFKJoinInfo nonRewritable = 
RewritablePKFKJoinInfo.of(false, null);
 
-if (joinType != JoinRelType.INNER && !join.isSemiJoin()) {
+if (joinType != JoinRelType.INNER && !join.isSemiJoin() && joinType != 
JoinRelType.ANTI) {

Review comment:
   This is interesting. An antijoin of a PK-FK join returns no rows? Can we 
create a JIRA for such optimization based on integrity constraints?

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinConstraintsRule.java
##
@@ -100,7 +100,8 @@ public void onMatch(RelOptRuleCall call) {
 // These boolean values represent corresponding left, right input which is 
potential FK
 boolean leftInputPotentialFK = topRefs.intersects(leftBits);
 boolean rightInputPotentialFK = topRefs.intersects(rightBits);
-if (leftInputPotentialFK && rightInputPotentialFK && (joinType == 
JoinRelType.INNER || joinType == JoinRelType.SEMI)) {
+if (leftInputPotentialFK && rightInputPotentialFK &&
+(joinType == JoinRelType.INNER || joinType == JoinRelType.SEMI || 
joinType == JoinRelType.ANTI)) {

Review comment:
   This is not correct and needs further thinking. If we have a PK-FK join 
that is only appending columns to the FK side, it basically means it is not 
filtering anything (everything is matching). If that is the case, then ANTIJOIN 
result would be empty? We could detect this at planning time and trigger the 
rewriting.
   
   Could we bail out from the rule if it is an ANTIJOIN and create a follow-up 
JIRA to tackle this and introduce further tests?

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinWithFilterToAntiJoinRule.java
##
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite.rules;
+
+import org.apache.calcite.plan.RelOptRule;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.plan.RelOptUtil;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.rel.core.Filter;
+import org.apache.calcite.rel.core.Join;
+import org.apache.calcite.rel.core.JoinRelType;
+import org.apache.calcite.rel.core.Project;
+import org.apache.calcite.rel.type.RelDataTypeField;
+import org.apache.calcite.rex.RexInputRef;
+import org.apache.calcite.rex.RexNode;
+import org.apache.calcite.sql.SqlKind;
+import org.apache.calcite.util.ImmutableBitSet;
+import org.slf4j.Logger;
+import

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-21 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461842&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461842
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 22/Jul/20 03:25
Start Date: 22/Jul/20 03:25
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r458511220



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinWithFilterToAntiJoinRule.java
##
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite.rules;
+
+import org.apache.calcite.plan.RelOptRule;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.plan.RelOptUtil;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.rel.core.Filter;
+import org.apache.calcite.rel.core.Join;
+import org.apache.calcite.rel.core.JoinRelType;
+import org.apache.calcite.rel.core.Project;
+import org.apache.calcite.rel.type.RelDataTypeField;
+import org.apache.calcite.rex.RexInputRef;
+import org.apache.calcite.rex.RexNode;
+import org.apache.calcite.sql.SqlKind;
+import org.apache.calcite.util.ImmutableBitSet;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.ArrayList;
+import java.util.List;
+
+/**
+ * Planner rule that converts a join plus filter to anti join.
+ */
+public class HiveJoinWithFilterToAntiJoinRule extends RelOptRule {
+  protected static final Logger LOG = 
LoggerFactory.getLogger(HiveJoinWithFilterToAntiJoinRule.class);
+  public static final HiveJoinWithFilterToAntiJoinRule INSTANCE = new 
HiveJoinWithFilterToAntiJoinRule();
+
+  //HiveProject(fld=[$0])
+  //  HiveFilter(condition=[IS NULL($1)])
+  //HiveJoin(condition=[=($0, $1)], joinType=[left], algorithm=[none], 
cost=[not available])
+  //
+  // TO
+  //
+  //HiveProject(fld_tbl=[$0])
+  //  HiveAntiJoin(condition=[=($0, $1)], joinType=[anti])
+  //
+  public HiveJoinWithFilterToAntiJoinRule() {
+super(operand(Project.class, operand(Filter.class, operand(Join.class, 
RelOptRule.any(,
+"HiveJoinWithFilterToAntiJoinRule:filter");
+  }
+
+  // is null filter over a left join.
+  public void onMatch(final RelOptRuleCall call) {
+final Project project = call.rel(0);
+final Filter filter = call.rel(1);
+final Join join = call.rel(2);
+perform(call, project, filter, join);
+  }
+
+  protected void perform(RelOptRuleCall call, Project project, Filter filter, 
Join join) {
+LOG.debug("Matched HiveAntiJoinRule");
+
+assert (filter != null);
+
+//We support conversion from left outer join only.
+if (join.getJoinType() != JoinRelType.LEFT) {
+  return;
+}
+
+List aboveFilters = 
RelOptUtil.conjunctions(filter.getCondition());
+boolean hasIsNull = false;
+
+// Get all filter condition and check if any of them is a "is null" kind.
+for (RexNode filterNode : aboveFilters) {
+  if (filterNode.getKind() == SqlKind.IS_NULL &&
+  isFilterFromRightSide(join, filterNode, join.getJoinType())) {
+hasIsNull = true;
+break;
+  }
+}
+
+// Is null should be on a key from right side of the join.
+if (!hasIsNull) {
+  return;
+}
+
+// Build anti join with same left, right child and condition as original 
left outer join.
+Join anti = join.copy(join.getTraitSet(), join.getCondition(),

Review comment:
   Probably it is here where we do not create the antijoin operator 
explicitly and why we end up with normal joins in Calcite plan. Since we are 
creating SemiJoin and AntiJoin as different operators, I think we should follow 
that pattern here and create an antijoin explicitly or using the builder (you 
can look at `HiveSemiJoinRule`). Nevertheless, we could possibly get rid of 
HiveAntiJoin and HiveSemiJoin all together as I mentioned in another comment, 
but that can be part of another JIRA.




--

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-21 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=461841&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-461841
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 22/Jul/20 03:25
Start Date: 22/Jul/20 03:25
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r458511220



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinWithFilterToAntiJoinRule.java
##
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite.rules;
+
+import org.apache.calcite.plan.RelOptRule;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.plan.RelOptUtil;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.rel.core.Filter;
+import org.apache.calcite.rel.core.Join;
+import org.apache.calcite.rel.core.JoinRelType;
+import org.apache.calcite.rel.core.Project;
+import org.apache.calcite.rel.type.RelDataTypeField;
+import org.apache.calcite.rex.RexInputRef;
+import org.apache.calcite.rex.RexNode;
+import org.apache.calcite.sql.SqlKind;
+import org.apache.calcite.util.ImmutableBitSet;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.ArrayList;
+import java.util.List;
+
+/**
+ * Planner rule that converts a join plus filter to anti join.
+ */
+public class HiveJoinWithFilterToAntiJoinRule extends RelOptRule {
+  protected static final Logger LOG = 
LoggerFactory.getLogger(HiveJoinWithFilterToAntiJoinRule.class);
+  public static final HiveJoinWithFilterToAntiJoinRule INSTANCE = new 
HiveJoinWithFilterToAntiJoinRule();
+
+  //HiveProject(fld=[$0])
+  //  HiveFilter(condition=[IS NULL($1)])
+  //HiveJoin(condition=[=($0, $1)], joinType=[left], algorithm=[none], 
cost=[not available])
+  //
+  // TO
+  //
+  //HiveProject(fld_tbl=[$0])
+  //  HiveAntiJoin(condition=[=($0, $1)], joinType=[anti])
+  //
+  public HiveJoinWithFilterToAntiJoinRule() {
+super(operand(Project.class, operand(Filter.class, operand(Join.class, 
RelOptRule.any(,
+"HiveJoinWithFilterToAntiJoinRule:filter");
+  }
+
+  // is null filter over a left join.
+  public void onMatch(final RelOptRuleCall call) {
+final Project project = call.rel(0);
+final Filter filter = call.rel(1);
+final Join join = call.rel(2);
+perform(call, project, filter, join);
+  }
+
+  protected void perform(RelOptRuleCall call, Project project, Filter filter, 
Join join) {
+LOG.debug("Matched HiveAntiJoinRule");
+
+assert (filter != null);
+
+//We support conversion from left outer join only.
+if (join.getJoinType() != JoinRelType.LEFT) {
+  return;
+}
+
+List aboveFilters = 
RelOptUtil.conjunctions(filter.getCondition());
+boolean hasIsNull = false;
+
+// Get all filter condition and check if any of them is a "is null" kind.
+for (RexNode filterNode : aboveFilters) {
+  if (filterNode.getKind() == SqlKind.IS_NULL &&
+  isFilterFromRightSide(join, filterNode, join.getJoinType())) {
+hasIsNull = true;
+break;
+  }
+}
+
+// Is null should be on a key from right side of the join.
+if (!hasIsNull) {
+  return;
+}
+
+// Build anti join with same left, right child and condition as original 
left outer join.
+Join anti = join.copy(join.getTraitSet(), join.getCondition(),

Review comment:
   Probably it is here where we do not create the antijoin operator 
explicitly and why we end up with normal joins in Calcite plan. Since we are 
creating SemiJoin and AntiJoin as different operators, I think we should follow 
that pattern here and create an antijoin explicitely. Nevertheless, we could 
possibly get rid of HiveAntiJoin and HiveSemiJoin all together as I mentioned 
in another comment, but that can be part of another JIRA.





This is an automated message from the A

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-22 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462362&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462362
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 23/Jul/20 03:39
Start Date: 23/Jul/20 03:39
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r459198718



##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java
##
@@ -523,11 +533,19 @@ private boolean createForwardJoinObject(boolean[] skip) 
throws HiveException {
 forward = true;
   }
 }
+return forward;
+  }
+
+  // returns whether a record was forwarded
+  private boolean createForwardJoinObject(boolean[] skip, boolean antiJoin) 
throws HiveException {
+boolean forward = fillFwdCache(skip);
 if (forward) {
   if (needsPostEvaluation) {
 forward = !JoinUtil.isFiltered(forwardCache, residualJoinFilters, 
residualJoinFiltersOIs);
   }
-  if (forward) {
+
+  // For anti join, check all right side and if nothing is matched then 
only forward.

Review comment:
   For anti join we don't emit the record here. It's done after all the 
records are checked and none of the record matches the condition. Here if 
forward is false we don't forward and as its a "&" we don't forward for anti 
join == true even if forward is true. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 462362)
Time Spent: 5h 50m  (was: 5h 40m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-22 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462376&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462376
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 23/Jul/20 04:26
Start Date: 23/Jul/20 04:26
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r459208067



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinLongOperator.java
##
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec.vector.mapjoin;
+
+import org.apache.hadoop.hive.ql.CompilationOpContext;
+import org.apache.hadoop.hive.ql.exec.JoinUtil;
+import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinLongHashSet;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.plan.OperatorDesc;
+import org.apache.hadoop.hive.ql.plan.VectorDesc;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Arrays;
+
+// TODO : Duplicate codes need to merge with semi join.
+// Single-Column Long hash table import.
+// Single-Column Long specific imports.
+
+/*
+ * Specialized class for doing a vectorized map join that is an anti join on a 
Single-Column Long
+ * using a hash set.
+ */
+public class VectorMapJoinAntiJoinLongOperator extends 
VectorMapJoinAntiJoinGenerateResultOperator {
+
+  private static final long serialVersionUID = 1L;
+  private static final String CLASS_NAME = 
VectorMapJoinAntiJoinLongOperator.class.getName();
+  private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME);
+  protected String getLoggingPrefix() {
+return super.getLoggingPrefix(CLASS_NAME);
+  }
+
+  // The above members are initialized by the constructor and must not be
+  // transient.
+
+  // The hash map for this specialized class.
+  private transient VectorMapJoinLongHashSet hashSet;
+
+  // Single-Column Long specific members.
+  // For integers, we have optional min/max filtering.
+  private transient boolean useMinMax;
+  private transient long min;
+  private transient long max;
+
+  // The column number for this one column join specialization.
+  private transient int singleJoinColumn;
+
+  // Pass-thru constructors.
+  /** Kryo ctor. */
+  protected VectorMapJoinAntiJoinLongOperator() {
+super();
+  }
+
+  public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx) {
+super(ctx);
+  }
+
+  public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx, 
OperatorDesc conf,
+   VectorizationContext vContext, 
VectorDesc vectorDesc) throws HiveException {
+super(ctx, conf, vContext, vectorDesc);
+  }
+
+  // Process Single-Column Long Anti Join on a vectorized row batch.
+  @Override
+  protected void commonSetup() throws HiveException {
+super.commonSetup();
+
+// Initialize Single-Column Long members for this specialized class.
+singleJoinColumn = bigTableKeyColumnMap[0];
+  }
+
+  @Override
+  public void hashTableSetup() throws HiveException {
+super.hashTableSetup();
+
+// Get our Single-Column Long hash set information for this specialized 
class.
+hashSet = (VectorMapJoinLongHashSet) vectorMapJoinHashTable;
+useMinMax = hashSet.useMinMax();
+if (useMinMax) {
+  min = hashSet.min();
+  max = hashSet.max();
+}
+  }
+
+  @Override
+  public void processBatch(VectorizedRowBatch batch) throws HiveException {
+
+try {
+  // (Currently none)

Review comment:
   The pre batch processing done only for joins which emits the right table 
records. For semi join and anti join, it's not required.




---

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-22 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462380&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462380
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 23/Jul/20 04:47
Start Date: 23/Jul/20 04:47
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r459212369



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinLongOperator.java
##
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec.vector.mapjoin;
+
+import org.apache.hadoop.hive.ql.CompilationOpContext;
+import org.apache.hadoop.hive.ql.exec.JoinUtil;
+import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinLongHashSet;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.plan.OperatorDesc;
+import org.apache.hadoop.hive.ql.plan.VectorDesc;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Arrays;
+
+// TODO : Duplicate codes need to merge with semi join.
+// Single-Column Long hash table import.
+// Single-Column Long specific imports.
+
+/*
+ * Specialized class for doing a vectorized map join that is an anti join on a 
Single-Column Long
+ * using a hash set.
+ */
+public class VectorMapJoinAntiJoinLongOperator extends 
VectorMapJoinAntiJoinGenerateResultOperator {
+
+  private static final long serialVersionUID = 1L;
+  private static final String CLASS_NAME = 
VectorMapJoinAntiJoinLongOperator.class.getName();
+  private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME);
+  protected String getLoggingPrefix() {
+return super.getLoggingPrefix(CLASS_NAME);
+  }
+
+  // The above members are initialized by the constructor and must not be
+  // transient.
+
+  // The hash map for this specialized class.
+  private transient VectorMapJoinLongHashSet hashSet;
+
+  // Single-Column Long specific members.
+  // For integers, we have optional min/max filtering.
+  private transient boolean useMinMax;
+  private transient long min;
+  private transient long max;
+
+  // The column number for this one column join specialization.
+  private transient int singleJoinColumn;
+
+  // Pass-thru constructors.
+  /** Kryo ctor. */
+  protected VectorMapJoinAntiJoinLongOperator() {
+super();
+  }
+
+  public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx) {
+super(ctx);
+  }
+
+  public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx, 
OperatorDesc conf,
+   VectorizationContext vContext, 
VectorDesc vectorDesc) throws HiveException {
+super(ctx, conf, vContext, vectorDesc);
+  }
+
+  // Process Single-Column Long Anti Join on a vectorized row batch.
+  @Override
+  protected void commonSetup() throws HiveException {
+super.commonSetup();
+
+// Initialize Single-Column Long members for this specialized class.
+singleJoinColumn = bigTableKeyColumnMap[0];
+  }
+
+  @Override
+  public void hashTableSetup() throws HiveException {
+super.hashTableSetup();
+
+// Get our Single-Column Long hash set information for this specialized 
class.
+hashSet = (VectorMapJoinLongHashSet) vectorMapJoinHashTable;
+useMinMax = hashSet.useMinMax();
+if (useMinMax) {
+  min = hashSet.min();
+  max = hashSet.max();
+}
+  }
+
+  @Override
+  public void processBatch(VectorizedRowBatch batch) throws HiveException {
+
+try {
+  // (Currently none)
+  // antiPerBatchSetup(batch);
+
+  // For anti joins, we may apply the filter(s) now.
+  for(VectorExpression ve : bigTableFilterExpressions) {
+ve.evaluate(batch);
+

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-22 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462384&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462384
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 23/Jul/20 05:01
Start Date: 23/Jul/20 05:01
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r459215352



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinAddNotNullRule.java
##
@@ -74,7 +78,14 @@ public HiveJoinAddNotNullRule(Class clazz,
   @Override
   public void onMatch(RelOptRuleCall call) {
 Join join = call.rel(0);
-if (join.getJoinType() == JoinRelType.FULL || 
join.getCondition().isAlwaysTrue()) {
+
+// For anti join case add the not null on right side if the condition is

Review comment:
   For the case when we have. join condition which gets evaluated, it will 
return false while comparing with a null on the right side. But for always true 
join condition, we will not do a match for right side assuming it's always 
true.  So for anti join, the left side records will not be emitted. To avoid 
this we put a null check on right side table and for all null entry, no records 
will be projected from right side and thus all records from left side will be 
emitted. So the comment is not very accurate. It's like even if the condition 
is always true, we add a null check on right side for anti join. I will update 
it.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 462384)
Time Spent: 6h 20m  (was: 6h 10m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-22 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462387&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462387
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 23/Jul/20 05:08
Start Date: 23/Jul/20 05:08
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r459216878



##
File path: ql/src/java/org/apache/hadoop/hive/ql/ppd/SyntheticJoinPredicate.java
##
@@ -339,6 +339,12 @@ String getFuncText(String funcText, final int srcPos) {
   vector.add(right, left);
   break;
 case JoinDesc.LEFT_OUTER_JOIN:
+case JoinDesc.ANTI_JOIN:
+//TODO : In case of anti join, bloom filter can be created on left 
side also ("IN (keylist right table)").
+// But the filter should be "not-in" ("NOT IN (keylist right table)") 
as we want to select the records from
+// left side which are not present in the right side. But it may cause 
wrong result as
+// bloom filter may have false positive and thus simply adding not is 
not correct,
+// special handling is required for "NOT IN".

Review comment:
   created a Jira ..https://issues.apache.org/jira/browse/HIVE-23903





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 462387)
Time Spent: 6.5h  (was: 6h 20m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-22 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462392&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462392
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 23/Jul/20 05:22
Start Date: 23/Jul/20 05:22
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r459219966



##
File path: 
ql/src/test/org/apache/hadoop/hive/ql/exec/vector/mapjoin/TestMapJoinOperator.java
##
@@ -1792,6 +1794,8 @@ private void executeTest(MapJoinTestDescription testDesc, 
MapJoinTestData testDa
 case FULL_OUTER:
   executeTestFullOuter(testDesc, testData, title);
   break;
+case ANTI: //TODO

Review comment:
   https://issues.apache.org/jira/browse/HIVE-23904





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 462392)
Time Spent: 6h 40m  (was: 6.5h)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 6h 40m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-22 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462394&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462394
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 23/Jul/20 05:25
Start Date: 23/Jul/20 05:25
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r459220734



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinGenerateResultOperator.java
##
@@ -0,0 +1,218 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec.vector.mapjoin;
+
+import org.apache.hadoop.hive.ql.CompilationOpContext;
+import org.apache.hadoop.hive.ql.exec.JoinUtil;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinHashSet;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinHashSetResult;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinHashTableResult;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.plan.OperatorDesc;
+import org.apache.hadoop.hive.ql.plan.VectorDesc;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+
+// TODO : This class is duplicate of semi join. Need to do a refactoring to 
merge it with semi join.

Review comment:
   https://issues.apache.org/jira/browse/HIVE-23905





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 462394)
Time Spent: 6h 50m  (was: 6h 40m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, jus

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462417&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462417
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 23/Jul/20 07:09
Start Date: 23/Jul/20 07:09
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r459253392



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveRelOptUtil.java
##
@@ -747,7 +747,7 @@ public static RewritablePKFKJoinInfo 
isRewritablePKFKJoin(Join join,
 final RelNode nonFkInput = leftInputPotentialFK ? join.getRight() : 
join.getLeft();
 final RewritablePKFKJoinInfo nonRewritable = 
RewritablePKFKJoinInfo.of(false, null);
 
-if (joinType != JoinRelType.INNER && !join.isSemiJoin()) {
+if (joinType != JoinRelType.INNER && !join.isSemiJoin() && joinType != 
JoinRelType.ANTI) {

Review comment:
   https://issues.apache.org/jira/browse/HIVE-23906





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 462417)
Time Spent: 7h  (was: 6h 50m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462446&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462446
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 23/Jul/20 08:57
Start Date: 23/Jul/20 08:57
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r459307547



##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java
##
@@ -523,11 +533,19 @@ private boolean createForwardJoinObject(boolean[] skip) 
throws HiveException {
 forward = true;
   }
 }
+return forward;
+  }
+
+  // returns whether a record was forwarded
+  private boolean createForwardJoinObject(boolean[] skip, boolean antiJoin) 
throws HiveException {
+boolean forward = fillFwdCache(skip);
 if (forward) {
   if (needsPostEvaluation) {
 forward = !JoinUtil.isFiltered(forwardCache, residualJoinFilters, 
residualJoinFiltersOIs);
   }
-  if (forward) {
+
+  // For anti join, check all right side and if nothing is matched then 
only forward.

Review comment:
   Ok makes sense now -- so maybe we should just mention that for anti-join 
we dont forward at this point





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 462446)
Time Spent: 7h 10m  (was: 7h)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 7h 10m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462448&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462448
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 23/Jul/20 08:58
Start Date: 23/Jul/20 08:58
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r459307921



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinLongOperator.java
##
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec.vector.mapjoin;
+
+import org.apache.hadoop.hive.ql.CompilationOpContext;
+import org.apache.hadoop.hive.ql.exec.JoinUtil;
+import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinLongHashSet;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.plan.OperatorDesc;
+import org.apache.hadoop.hive.ql.plan.VectorDesc;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Arrays;
+
+// TODO : Duplicate codes need to merge with semi join.
+// Single-Column Long hash table import.
+// Single-Column Long specific imports.
+
+/*
+ * Specialized class for doing a vectorized map join that is an anti join on a 
Single-Column Long
+ * using a hash set.
+ */
+public class VectorMapJoinAntiJoinLongOperator extends 
VectorMapJoinAntiJoinGenerateResultOperator {
+
+  private static final long serialVersionUID = 1L;
+  private static final String CLASS_NAME = 
VectorMapJoinAntiJoinLongOperator.class.getName();
+  private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME);
+  protected String getLoggingPrefix() {
+return super.getLoggingPrefix(CLASS_NAME);
+  }
+
+  // The above members are initialized by the constructor and must not be
+  // transient.
+
+  // The hash map for this specialized class.
+  private transient VectorMapJoinLongHashSet hashSet;
+
+  // Single-Column Long specific members.
+  // For integers, we have optional min/max filtering.
+  private transient boolean useMinMax;
+  private transient long min;
+  private transient long max;
+
+  // The column number for this one column join specialization.
+  private transient int singleJoinColumn;
+
+  // Pass-thru constructors.
+  /** Kryo ctor. */
+  protected VectorMapJoinAntiJoinLongOperator() {
+super();
+  }
+
+  public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx) {
+super(ctx);
+  }
+
+  public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx, 
OperatorDesc conf,
+   VectorizationContext vContext, 
VectorDesc vectorDesc) throws HiveException {
+super(ctx, conf, vContext, vectorDesc);
+  }
+
+  // Process Single-Column Long Anti Join on a vectorized row batch.
+  @Override
+  protected void commonSetup() throws HiveException {
+super.commonSetup();
+
+// Initialize Single-Column Long members for this specialized class.
+singleJoinColumn = bigTableKeyColumnMap[0];
+  }
+
+  @Override
+  public void hashTableSetup() throws HiveException {
+super.hashTableSetup();
+
+// Get our Single-Column Long hash set information for this specialized 
class.
+hashSet = (VectorMapJoinLongHashSet) vectorMapJoinHashTable;
+useMinMax = hashSet.useMinMax();
+if (useMinMax) {
+  min = hashSet.min();
+  max = hashSet.max();
+}
+  }
+
+  @Override
+  public void processBatch(VectorizedRowBatch batch) throws HiveException {
+
+try {
+  // (Currently none)
+  // antiPerBatchSetup(batch);
+
+  // For anti joins, we may apply the filter(s) now.
+  for(VectorExpression ve : bigTableFilterExpressions) {
+ve.evaluate(batch);
+  }
+

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462450&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462450
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 23/Jul/20 09:00
Start Date: 23/Jul/20 09:00
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r459308986



##
File path: ql/src/java/org/apache/hadoop/hive/ql/ppd/SyntheticJoinPredicate.java
##
@@ -339,6 +339,12 @@ String getFuncText(String funcText, final int srcPos) {
   vector.add(right, left);
   break;
 case JoinDesc.LEFT_OUTER_JOIN:
+case JoinDesc.ANTI_JOIN:
+//TODO : In case of anti join, bloom filter can be created on left 
side also ("IN (keylist right table)").
+// But the filter should be "not-in" ("NOT IN (keylist right table)") 
as we want to select the records from
+// left side which are not present in the right side. But it may cause 
wrong result as
+// bloom filter may have false positive and thus simply adding not is 
not correct,
+// special handling is required for "NOT IN".

Review comment:
   Thank Mahesh! Has this in the back of my head for a while -- this will 
be useful for a bunch of cases including anti-joins

##
File path: ql/src/java/org/apache/hadoop/hive/ql/ppd/SyntheticJoinPredicate.java
##
@@ -339,6 +339,12 @@ String getFuncText(String funcText, final int srcPos) {
   vector.add(right, left);
   break;
 case JoinDesc.LEFT_OUTER_JOIN:
+case JoinDesc.ANTI_JOIN:
+//TODO : In case of anti join, bloom filter can be created on left 
side also ("IN (keylist right table)").
+// But the filter should be "not-in" ("NOT IN (keylist right table)") 
as we want to select the records from
+// left side which are not present in the right side. But it may cause 
wrong result as
+// bloom filter may have false positive and thus simply adding not is 
not correct,
+// special handling is required for "NOT IN".

Review comment:
   Thank Mahesh! Had this in the back of my head for a while -- this will 
be useful for a bunch of cases including anti-joins





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 462450)
Time Spent: 7.5h  (was: 7h 20m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 7.5h
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exis

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462452&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462452
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 23/Jul/20 09:03
Start Date: 23/Jul/20 09:03
Worklog Time Spent: 10m 
  Work Description: pgaref commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r459310372



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinAddNotNullRule.java
##
@@ -74,7 +78,14 @@ public HiveJoinAddNotNullRule(Class clazz,
   @Override
   public void onMatch(RelOptRuleCall call) {
 Join join = call.rel(0);
-if (join.getJoinType() == JoinRelType.FULL || 
join.getCondition().isAlwaysTrue()) {
+
+// For anti join case add the not null on right side if the condition is

Review comment:
   Thanks! Makes sense now





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 462452)
Time Spent: 7h 40m  (was: 7.5h)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 7h 40m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462805&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462805
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 24/Jul/20 02:35
Start Date: 24/Jul/20 02:35
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r459825977



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/reloperators/HiveAntiJoin.java
##
@@ -0,0 +1,95 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite.reloperators;
+
+import com.google.common.collect.ImmutableList;
+import com.google.common.collect.Sets;
+import org.apache.calcite.plan.RelOptCluster;
+import org.apache.calcite.plan.RelTraitSet;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.rel.core.Join;
+import org.apache.calcite.rel.core.JoinRelType;
+import org.apache.calcite.rel.type.RelDataTypeField;
+import org.apache.calcite.rex.RexNode;
+import org.apache.hadoop.hive.ql.optimizer.calcite.CalciteSemanticException;
+import org.apache.hadoop.hive.ql.optimizer.calcite.HiveRelOptUtil;
+import org.apache.hadoop.hive.ql.optimizer.calcite.rules.HiveRulesRegistry;
+
+import java.util.ArrayList;
+import java.util.List;
+
+public class HiveAntiJoin extends Join implements HiveRelNode {
+
+  private final RexNode joinFilter;

Review comment:
   The joinjoinFilter holds the residual filter which is used during post 
processing. These are the join conditions that are not part of the join key. I 
think condition in Join hold the full condition. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 462805)
Time Spent: 7h 50m  (was: 7h 40m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 7h 50m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462810&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462810
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 24/Jul/20 02:39
Start Date: 24/Jul/20 02:39
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r459826636



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/reloperators/HiveAntiJoin.java
##
@@ -0,0 +1,95 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite.reloperators;
+
+import com.google.common.collect.ImmutableList;
+import com.google.common.collect.Sets;
+import org.apache.calcite.plan.RelOptCluster;
+import org.apache.calcite.plan.RelTraitSet;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.rel.core.Join;
+import org.apache.calcite.rel.core.JoinRelType;
+import org.apache.calcite.rel.type.RelDataTypeField;
+import org.apache.calcite.rex.RexNode;
+import org.apache.hadoop.hive.ql.optimizer.calcite.CalciteSemanticException;
+import org.apache.hadoop.hive.ql.optimizer.calcite.HiveRelOptUtil;
+import org.apache.hadoop.hive.ql.optimizer.calcite.rules.HiveRulesRegistry;
+
+import java.util.ArrayList;
+import java.util.List;
+
+public class HiveAntiJoin extends Join implements HiveRelNode {

Review comment:
   https://issues.apache.org/jira/browse/HIVE-23919





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 462810)
Time Spent: 8h  (was: 7h 50m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 8h
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converte

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462811&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462811
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 24/Jul/20 02:41
Start Date: 24/Jul/20 02:41
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r459827002



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinAddNotNullRule.java
##
@@ -74,7 +78,14 @@ public HiveJoinAddNotNullRule(Class clazz,
   @Override
   public void onMatch(RelOptRuleCall call) {
 Join join = call.rel(0);
-if (join.getJoinType() == JoinRelType.FULL || 
join.getCondition().isAlwaysTrue()) {
+
+// For anti join case add the not null on right side if the condition is
+// always true. This is done because during execution, anti join expect 
the right side to
+// be empty and if we dont put null check on right, for null only right 
side table and condition
+// always true, execution will produce 0 records.
+// eg  select * from left_tbl where (select 1 from all_null_right limit 1) 
is null
+if (join.getJoinType() == JoinRelType.FULL ||
+(join.getJoinType() != JoinRelType.ANTI && 
join.getCondition().isAlwaysTrue())) {

Review comment:
   Yes, the comment is not proper. It's like we will add a not null 
condition for anti join even if the condition is always true.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 462811)
Time Spent: 8h 10m  (was: 8h)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 8h 10m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462815&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462815
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 24/Jul/20 03:52
Start Date: 24/Jul/20 03:52
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r459840139



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinConstraintsRule.java
##
@@ -100,7 +100,8 @@ public void onMatch(RelOptRuleCall call) {
 // These boolean values represent corresponding left, right input which is 
potential FK
 boolean leftInputPotentialFK = topRefs.intersects(leftBits);
 boolean rightInputPotentialFK = topRefs.intersects(rightBits);
-if (leftInputPotentialFK && rightInputPotentialFK && (joinType == 
JoinRelType.INNER || joinType == JoinRelType.SEMI)) {
+if (leftInputPotentialFK && rightInputPotentialFK &&
+(joinType == JoinRelType.INNER || joinType == JoinRelType.SEMI || 
joinType == JoinRelType.ANTI)) {

Review comment:
   https://issues.apache.org/jira/browse/HIVE-23920





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 462815)
Time Spent: 8h 20m  (was: 8h 10m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 8h 20m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-23 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462816&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462816
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 24/Jul/20 03:57
Start Date: 24/Jul/20 03:57
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r459841087



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinProjectTransposeRule.java
##
@@ -133,6 +135,10 @@ private HiveJoinProjectTransposeRuleBase(
 
 public void onMatch(RelOptRuleCall call) {
   //TODO: this can be removed once CALCITE-3824 is released
+  Join joinRel = call.rel(0);
+  if (joinRel.getJoinType() == JoinRelType.ANTI) {

Review comment:
   This was causing some issue with having clause. 
https://issues.apache.org/jira/browse/HIVE-23921





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 462816)
Time Spent: 8.5h  (was: 8h 20m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 8.5h
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-24 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462903&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462903
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 24/Jul/20 10:55
Start Date: 24/Jul/20 10:55
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r459984171



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinWithFilterToAntiJoinRule.java
##
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite.rules;
+
+import org.apache.calcite.plan.RelOptRule;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.plan.RelOptUtil;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.rel.core.Filter;
+import org.apache.calcite.rel.core.Join;
+import org.apache.calcite.rel.core.JoinRelType;
+import org.apache.calcite.rel.core.Project;
+import org.apache.calcite.rel.type.RelDataTypeField;
+import org.apache.calcite.rex.RexInputRef;
+import org.apache.calcite.rex.RexNode;
+import org.apache.calcite.sql.SqlKind;
+import org.apache.calcite.util.ImmutableBitSet;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.ArrayList;
+import java.util.List;
+
+/**
+ * Planner rule that converts a join plus filter to anti join.
+ */
+public class HiveJoinWithFilterToAntiJoinRule extends RelOptRule {
+  protected static final Logger LOG = 
LoggerFactory.getLogger(HiveJoinWithFilterToAntiJoinRule.class);
+  public static final HiveJoinWithFilterToAntiJoinRule INSTANCE = new 
HiveJoinWithFilterToAntiJoinRule();
+
+  //HiveProject(fld=[$0])
+  //  HiveFilter(condition=[IS NULL($1)])
+  //HiveJoin(condition=[=($0, $1)], joinType=[left], algorithm=[none], 
cost=[not available])
+  //
+  // TO
+  //
+  //HiveProject(fld_tbl=[$0])
+  //  HiveAntiJoin(condition=[=($0, $1)], joinType=[anti])
+  //
+  public HiveJoinWithFilterToAntiJoinRule() {
+super(operand(Project.class, operand(Filter.class, operand(Join.class, 
RelOptRule.any(,
+"HiveJoinWithFilterToAntiJoinRule:filter");
+  }
+
+  // is null filter over a left join.
+  public void onMatch(final RelOptRuleCall call) {
+final Project project = call.rel(0);
+final Filter filter = call.rel(1);
+final Join join = call.rel(2);
+perform(call, project, filter, join);
+  }
+
+  protected void perform(RelOptRuleCall call, Project project, Filter filter, 
Join join) {
+LOG.debug("Matched HiveAntiJoinRule");
+
+if (join.getCondition().isAlwaysTrue()) {
+  return;
+}
+
+//We support conversion from left outer join only.
+if (join.getJoinType() != JoinRelType.LEFT) {
+  return;
+}
+
+assert (filter != null);
+
+List aboveFilters = 
RelOptUtil.conjunctions(filter.getCondition());
+boolean hasIsNull = false;
+
+// Get all filter condition and check if any of them is a "is null" kind.
+for (RexNode filterNode : aboveFilters) {
+  if (filterNode.getKind() == SqlKind.IS_NULL &&
+  isFilterFromRightSide(join, filterNode, join.getJoinType())) {
+hasIsNull = true;
+break;
+  }
+}
+
+// Is null should be on a key from right side of the join.
+if (!hasIsNull) {
+  return;
+}
+
+// Build anti join with same left, right child and condition as original 
left outer join.
+Join anti = join.copy(join.getTraitSet(), join.getCondition(),
+join.getLeft(), join.getRight(), JoinRelType.ANTI, false);
+
+//TODO : Do we really need it
+call.getPlanner().onCopy(join, anti);
+
+RelNode newProject = getNewProjectNode(project, anti);
+if (newProject != null) {
+  call.getPlanner().onCopy(project, newProject);
+  call.transformTo(newProject);
+}
+  }
+
+  protected RelNode getNewProjectNode(Project oldProject, Join newJoin) {

Review comment:
   I didn't find any such utility method, so added thi

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-24 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462904&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462904
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 24/Jul/20 11:01
Start Date: 24/Jul/20 11:01
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r459986329



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveSubQueryRemoveRule.java
##
@@ -414,6 +416,13 @@ private RexNode rewriteInExists(RexSubQuery e, 
Set variablesSet,
   // null keys we do not need to generate count(*), count(c)
   if (e.getKind() == SqlKind.EXISTS) {
 logic = RelOptUtil.Logic.TRUE_FALSE;
+if (conf.getBoolVar(HiveConf.ConfVars.HIVE_CONVERT_ANTI_JOIN)) {
+  //TODO : As of now anti join is first converted to left outer join

Review comment:
   Now also the conversion is not done. The code is present but actual 
conversion is not done and logic is still TRUE_FALSE. For the code to be 
effective , the logic should be changed to FALSE. I have not done it yet, as it 
was causing some change in plan which i could not judge to be expected or not. 
Anyways i have created a JIRA to track this.
   https://issues.apache.org/jira/browse/HIVE-23928





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 462904)
Time Spent: 8h 50m  (was: 8h 40m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 8h 50m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-24 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462906&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462906
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 24/Jul/20 11:06
Start Date: 24/Jul/20 11:06
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r459988214



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdDistinctRowCount.java
##
@@ -79,6 +80,11 @@ public Double getDistinctRowCount(HiveSemiJoin rel, 
RelMetadataQuery mq, Immutab
 return super.getDistinctRowCount(rel, mq, groupKey, predicate);
   }
 
+  public Double getDistinctRowCount(HiveAntiJoin rel, RelMetadataQuery mq, 
ImmutableBitSet groupKey,
+RexNode predicate) {
+return super.getDistinctRowCount(rel, mq, groupKey, predicate);

Review comment:
   calcite 21 does not support distinct calculation for Anti join. 
 if (join.isSemiJoin()) {
   return getSemiJoinDistinctRowCount(join, mq, groupKey, predicate);
 } else {
   Builder leftMask = ImmutableBitSet.builder();
   I think these rules will not get triggered as of now for Anti join as i am 
not converting the not-exists to anti join. As of now all these rules will be 
applied on left outer and then we convert the left outer to anti join.
   I 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 462906)
Time Spent: 9h  (was: 8h 50m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 9h
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-24 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462907&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462907
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 24/Jul/20 11:12
Start Date: 24/Jul/20 11:12
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r459990495



##
File path: 
ql/src/test/results/clientpositive/perf/tez/cbo_query16_anti_join.q.out
##
@@ -0,0 +1,99 @@
+PREHOOK: query: explain cbo
+select
+   count(distinct cs_order_number) as `order count`
+  ,sum(cs_ext_ship_cost) as `total shipping cost`
+  ,sum(cs_net_profit) as `total net profit`
+from
+   catalog_sales cs1
+  ,date_dim
+  ,customer_address
+  ,call_center
+where
+d_date between '2001-4-01' and
+   (cast('2001-4-01' as date) + 60 days)
+and cs1.cs_ship_date_sk = d_date_sk
+and cs1.cs_ship_addr_sk = ca_address_sk
+and ca_state = 'NY'
+and cs1.cs_call_center_sk = cc_call_center_sk
+and cc_county in ('Ziebach County','Levy County','Huron County','Franklin 
Parish',
+  'Daviess County'
+)
+and exists (select *
+from catalog_sales cs2
+where cs1.cs_order_number = cs2.cs_order_number
+  and cs1.cs_warehouse_sk <> cs2.cs_warehouse_sk)
+and not exists(select *
+   from catalog_returns cr1
+   where cs1.cs_order_number = cr1.cr_order_number)
+order by count(distinct cs_order_number)
+limit 100
+PREHOOK: type: QUERY
+PREHOOK: Input: default@call_center
+PREHOOK: Input: default@catalog_returns
+PREHOOK: Input: default@catalog_sales
+PREHOOK: Input: default@customer_address
+PREHOOK: Input: default@date_dim
+PREHOOK: Output: hdfs://### HDFS PATH ###
+POSTHOOK: query: explain cbo
+select
+   count(distinct cs_order_number) as `order count`
+  ,sum(cs_ext_ship_cost) as `total shipping cost`
+  ,sum(cs_net_profit) as `total net profit`
+from
+   catalog_sales cs1
+  ,date_dim
+  ,customer_address
+  ,call_center
+where
+d_date between '2001-4-01' and
+   (cast('2001-4-01' as date) + 60 days)
+and cs1.cs_ship_date_sk = d_date_sk
+and cs1.cs_ship_addr_sk = ca_address_sk
+and ca_state = 'NY'
+and cs1.cs_call_center_sk = cc_call_center_sk
+and cc_county in ('Ziebach County','Levy County','Huron County','Franklin 
Parish',
+  'Daviess County'
+)
+and exists (select *
+from catalog_sales cs2
+where cs1.cs_order_number = cs2.cs_order_number
+  and cs1.cs_warehouse_sk <> cs2.cs_warehouse_sk)
+and not exists(select *
+   from catalog_returns cr1
+   where cs1.cs_order_number = cr1.cr_order_number)
+order by count(distinct cs_order_number)
+limit 100
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@call_center
+POSTHOOK: Input: default@catalog_returns
+POSTHOOK: Input: default@catalog_sales
+POSTHOOK: Input: default@customer_address
+POSTHOOK: Input: default@date_dim
+POSTHOOK: Output: hdfs://### HDFS PATH ###
+CBO PLAN:
+HiveAggregate(group=[{}], agg#0=[count(DISTINCT $4)], agg#1=[sum($5)], 
agg#2=[sum($6)])
+  HiveJoin(condition=[=($4, $14)], joinType=[anti], algorithm=[none], 
cost=[not available])

Review comment:
   I think, it's not a problem. The filed index are for different input. So 
even though the number is same the condition is different. Even without anti 
join, the condition is same.
   
 HiveFilter(condition=[IS NULL($13)])
   HiveJoin(condition=[=($4, $14)], joinType=[left], algorithm=[none], 
cost=[not available])
 HiveSemiJoin(condition=[AND(<>($3, $13), =($4, $14))], joinType=[semi])





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 462907)
Time Spent: 9h 10m  (was: 9h)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 9h 10m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant c

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-24 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462910&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462910
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 24/Jul/20 11:33
Start Date: 24/Jul/20 11:33
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r459998845



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinWithFilterToAntiJoinRule.java
##
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite.rules;
+
+import org.apache.calcite.plan.RelOptRule;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.plan.RelOptUtil;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.rel.core.Filter;
+import org.apache.calcite.rel.core.Join;
+import org.apache.calcite.rel.core.JoinRelType;
+import org.apache.calcite.rel.core.Project;
+import org.apache.calcite.rel.type.RelDataTypeField;
+import org.apache.calcite.rex.RexInputRef;
+import org.apache.calcite.rex.RexNode;
+import org.apache.calcite.sql.SqlKind;
+import org.apache.calcite.util.ImmutableBitSet;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.ArrayList;
+import java.util.List;
+
+/**
+ * Planner rule that converts a join plus filter to anti join.
+ */
+public class HiveJoinWithFilterToAntiJoinRule extends RelOptRule {
+  protected static final Logger LOG = 
LoggerFactory.getLogger(HiveJoinWithFilterToAntiJoinRule.class);
+  public static final HiveJoinWithFilterToAntiJoinRule INSTANCE = new 
HiveJoinWithFilterToAntiJoinRule();
+
+  //HiveProject(fld=[$0])
+  //  HiveFilter(condition=[IS NULL($1)])
+  //HiveJoin(condition=[=($0, $1)], joinType=[left], algorithm=[none], 
cost=[not available])
+  //
+  // TO
+  //
+  //HiveProject(fld_tbl=[$0])
+  //  HiveAntiJoin(condition=[=($0, $1)], joinType=[anti])
+  //
+  public HiveJoinWithFilterToAntiJoinRule() {
+super(operand(Project.class, operand(Filter.class, operand(Join.class, 
RelOptRule.any(,
+"HiveJoinWithFilterToAntiJoinRule:filter");
+  }
+
+  // is null filter over a left join.
+  public void onMatch(final RelOptRuleCall call) {
+final Project project = call.rel(0);
+final Filter filter = call.rel(1);
+final Join join = call.rel(2);
+perform(call, project, filter, join);
+  }
+
+  protected void perform(RelOptRuleCall call, Project project, Filter filter, 
Join join) {
+LOG.debug("Matched HiveAntiJoinRule");
+
+assert (filter != null);
+
+//We support conversion from left outer join only.
+if (join.getJoinType() != JoinRelType.LEFT) {
+  return;
+}
+
+List aboveFilters = 
RelOptUtil.conjunctions(filter.getCondition());
+boolean hasIsNull = false;
+
+// Get all filter condition and check if any of them is a "is null" kind.
+for (RexNode filterNode : aboveFilters) {
+  if (filterNode.getKind() == SqlKind.IS_NULL &&
+  isFilterFromRightSide(join, filterNode, join.getJoinType())) {
+hasIsNull = true;
+break;
+  }
+}
+
+// Is null should be on a key from right side of the join.
+if (!hasIsNull) {
+  return;
+}
+
+// Build anti join with same left, right child and condition as original 
left outer join.
+Join anti = join.copy(join.getTraitSet(), join.getCondition(),

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 462910)
Time Spent: 9h 20m  (was: 9h 10m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
>

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-24 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462918&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462918
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 24/Jul/20 11:39
Start Date: 24/Jul/20 11:39
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460001236



##
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java
##
@@ -1901,6 +1905,11 @@ public RelNode apply(RelOptCluster cluster, RelOptSchema 
relOptSchema, SchemaPlu
   calcitePreCboPlan = applyPreJoinOrderingTransforms(calciteGenPlan,
   mdProvider.getMetadataProvider(), executorProvider);
 
+  if (conf.getBoolVar(ConfVars.HIVE_CONVERT_ANTI_JOIN)) {

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 462918)
Time Spent: 9h 40m  (was: 9.5h)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 9h 40m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-24 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462911&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462911
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 24/Jul/20 11:33
Start Date: 24/Jul/20 11:33
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r459998986



##
File path: 
ql/src/test/results/clientpositive/perf/tez/cbo_query16_anti_join.q.out
##
@@ -0,0 +1,99 @@
+PREHOOK: query: explain cbo
+select
+   count(distinct cs_order_number) as `order count`
+  ,sum(cs_ext_ship_cost) as `total shipping cost`
+  ,sum(cs_net_profit) as `total net profit`
+from
+   catalog_sales cs1
+  ,date_dim
+  ,customer_address
+  ,call_center
+where
+d_date between '2001-4-01' and
+   (cast('2001-4-01' as date) + 60 days)
+and cs1.cs_ship_date_sk = d_date_sk
+and cs1.cs_ship_addr_sk = ca_address_sk
+and ca_state = 'NY'
+and cs1.cs_call_center_sk = cc_call_center_sk
+and cc_county in ('Ziebach County','Levy County','Huron County','Franklin 
Parish',
+  'Daviess County'
+)
+and exists (select *
+from catalog_sales cs2
+where cs1.cs_order_number = cs2.cs_order_number
+  and cs1.cs_warehouse_sk <> cs2.cs_warehouse_sk)
+and not exists(select *
+   from catalog_returns cr1
+   where cs1.cs_order_number = cr1.cr_order_number)
+order by count(distinct cs_order_number)
+limit 100
+PREHOOK: type: QUERY
+PREHOOK: Input: default@call_center
+PREHOOK: Input: default@catalog_returns
+PREHOOK: Input: default@catalog_sales
+PREHOOK: Input: default@customer_address
+PREHOOK: Input: default@date_dim
+PREHOOK: Output: hdfs://### HDFS PATH ###
+POSTHOOK: query: explain cbo
+select
+   count(distinct cs_order_number) as `order count`
+  ,sum(cs_ext_ship_cost) as `total shipping cost`
+  ,sum(cs_net_profit) as `total net profit`
+from
+   catalog_sales cs1
+  ,date_dim
+  ,customer_address
+  ,call_center
+where
+d_date between '2001-4-01' and
+   (cast('2001-4-01' as date) + 60 days)
+and cs1.cs_ship_date_sk = d_date_sk
+and cs1.cs_ship_addr_sk = ca_address_sk
+and ca_state = 'NY'
+and cs1.cs_call_center_sk = cc_call_center_sk
+and cc_county in ('Ziebach County','Levy County','Huron County','Franklin 
Parish',
+  'Daviess County'
+)
+and exists (select *
+from catalog_sales cs2
+where cs1.cs_order_number = cs2.cs_order_number
+  and cs1.cs_warehouse_sk <> cs2.cs_warehouse_sk)
+and not exists(select *
+   from catalog_returns cr1
+   where cs1.cs_order_number = cr1.cr_order_number)
+order by count(distinct cs_order_number)
+limit 100
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@call_center
+POSTHOOK: Input: default@catalog_returns
+POSTHOOK: Input: default@catalog_sales
+POSTHOOK: Input: default@customer_address
+POSTHOOK: Input: default@date_dim
+POSTHOOK: Output: hdfs://### HDFS PATH ###
+CBO PLAN:
+HiveAggregate(group=[{}], agg#0=[count(DISTINCT $4)], agg#1=[sum($5)], 
agg#2=[sum($6)])
+  HiveJoin(condition=[=($4, $14)], joinType=[anti], algorithm=[none], 
cost=[not available])
+HiveSemiJoin(condition=[AND(<>($3, $13), =($4, $14))], joinType=[semi])

Review comment:
   done ..creating the HiveAntiJoin operator directly





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 462911)
Time Spent: 9.5h  (was: 9h 20m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the dup

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-24 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462919&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462919
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 24/Jul/20 11:41
Start Date: 24/Jul/20 11:41
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460001925



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java
##
@@ -2606,6 +2607,17 @@ private long computeFinalRowCount(List 
rowCountParents, long interimRowCou
   // max # of rows = rows from left side
   result = Math.min(rowCountParents.get(joinCond.getLeft()), result);
   break;
+case JoinDesc.ANTI_JOIN:
+  long leftRowCount = rowCountParents.get(joinCond.getLeft());
+  if (leftRowCount < result) {
+// Ideally the inner join count should be less than the left row 
count. but if its not calculated
+// properly then we can assume whole of left table will be 
selected.
+result = leftRowCount;

Review comment:
   This case will come if the stats are not proper. So to be on safer side, 
i assume that all rows from the left side will be projected. That is the max 
value. If set it to 0, it should not trigger some re-write, assuming the join 
result is empty.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 462919)
Time Spent: 9h 50m  (was: 9h 40m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 9h 50m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-24 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=462920&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-462920
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 24/Jul/20 11:42
Start Date: 24/Jul/20 11:42
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460002261



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdSelectivity.java
##
@@ -142,7 +146,7 @@ private Double computeInnerJoinSelectivity(Join j, 
RelMetadataQuery mq, RexNode
 ndvEstimate = exponentialBackoff(peLst, colStatMap);
   }
 
-  if (j.isSemiJoin()) {
+  if (j.isSemiJoin() || (j instanceof HiveJoin && 
j.getJoinType().equals(JoinRelType.ANTI))) {

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 462920)
Time Spent: 10h  (was: 9h 50m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 10h
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=46&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-46
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 11:29
Start Date: 26/Jul/20 11:29
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460515257



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveRemoveGBYSemiJoinRule.java
##
@@ -41,17 +41,19 @@
 
   public HiveRemoveGBYSemiJoinRule() {
 super(
-operand(HiveSemiJoin.class,
+operand(Join.class,
 some(
 operand(RelNode.class, any()),
 operand(Aggregate.class, any(,
 HiveRelFactories.HIVE_BUILDER, "HiveRemoveGBYSemiJoinRule");
   }
 
   @Override public void onMatch(RelOptRuleCall call) {
-final HiveSemiJoin semijoin= call.rel(0);
+final Join join= call.rel(0);

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 46)
Time Spent: 10h 10m  (was: 10h)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 10h 10m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463334&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463334
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 11:34
Start Date: 26/Jul/20 11:34
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460515695



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdRowCount.java
##
@@ -118,6 +119,15 @@ public Double getRowCount(HiveJoin join, RelMetadataQuery 
mq) {
   }
 
   public Double getRowCount(HiveSemiJoin rel, RelMetadataQuery mq) {
+return getRowCountInt(rel, mq);
+  }
+
+  public Double getRowCount(HiveAntiJoin rel, RelMetadataQuery mq) {
+return getRowCountInt(rel, mq);
+  }
+
+  private Double getRowCountInt(Join rel, RelMetadataQuery mq) {

Review comment:
   Yes done.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 463334)
Time Spent: 10h 20m  (was: 10h 10m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 10h 20m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463335&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463335
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 11:59
Start Date: 26/Jul/20 11:59
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460518411



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdRowCount.java
##
@@ -118,6 +119,15 @@ public Double getRowCount(HiveJoin join, RelMetadataQuery 
mq) {
   }
 
   public Double getRowCount(HiveSemiJoin rel, RelMetadataQuery mq) {
+return getRowCountInt(rel, mq);
+  }
+
+  public Double getRowCount(HiveAntiJoin rel, RelMetadataQuery mq) {
+return getRowCountInt(rel, mq);
+  }
+
+  private Double getRowCountInt(Join rel, RelMetadataQuery mq) {

Review comment:
   super.getRowCount(rel, mq) does not support Anti join. I think we need 
to handle it.
   https://issues.apache.org/jira/browse/HIVE-23933





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 463335)
Time Spent: 10.5h  (was: 10h 20m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 10.5h
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463336&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463336
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 11:59
Start Date: 26/Jul/20 11:59
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460518454



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdDistinctRowCount.java
##
@@ -79,6 +80,11 @@ public Double getDistinctRowCount(HiveSemiJoin rel, 
RelMetadataQuery mq, Immutab
 return super.getDistinctRowCount(rel, mq, groupKey, predicate);
   }
 
+  public Double getDistinctRowCount(HiveAntiJoin rel, RelMetadataQuery mq, 
ImmutableBitSet groupKey,
+RexNode predicate) {
+return super.getDistinctRowCount(rel, mq, groupKey, predicate);

Review comment:
   https://issues.apache.org/jira/browse/HIVE-23933





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 463336)
Time Spent: 10h 40m  (was: 10.5h)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 10h 40m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463337&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463337
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 12:03
Start Date: 26/Jul/20 12:03
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460518799



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinWithFilterToAntiJoinRule.java
##
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite.rules;
+
+import org.apache.calcite.plan.RelOptRule;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.plan.RelOptUtil;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.rel.core.Filter;
+import org.apache.calcite.rel.core.Join;
+import org.apache.calcite.rel.core.JoinRelType;
+import org.apache.calcite.rel.core.Project;
+import org.apache.calcite.rel.type.RelDataTypeField;
+import org.apache.calcite.rex.RexInputRef;
+import org.apache.calcite.rex.RexNode;
+import org.apache.calcite.sql.SqlKind;
+import org.apache.calcite.util.ImmutableBitSet;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.ArrayList;
+import java.util.List;
+
+/**
+ * Planner rule that converts a join plus filter to anti join.
+ */
+public class HiveJoinWithFilterToAntiJoinRule extends RelOptRule {
+  protected static final Logger LOG = 
LoggerFactory.getLogger(HiveJoinWithFilterToAntiJoinRule.class);
+  public static final HiveJoinWithFilterToAntiJoinRule INSTANCE = new 
HiveJoinWithFilterToAntiJoinRule();
+
+  //HiveProject(fld=[$0])
+  //  HiveFilter(condition=[IS NULL($1)])
+  //HiveJoin(condition=[=($0, $1)], joinType=[left], algorithm=[none], 
cost=[not available])
+  //
+  // TO
+  //
+  //HiveProject(fld_tbl=[$0])
+  //  HiveAntiJoin(condition=[=($0, $1)], joinType=[anti])
+  //
+  public HiveJoinWithFilterToAntiJoinRule() {
+super(operand(Project.class, operand(Filter.class, operand(Join.class, 
RelOptRule.any(,
+"HiveJoinWithFilterToAntiJoinRule:filter");
+  }
+
+  // is null filter over a left join.
+  public void onMatch(final RelOptRuleCall call) {
+final Project project = call.rel(0);
+final Filter filter = call.rel(1);
+final Join join = call.rel(2);
+perform(call, project, filter, join);
+  }
+
+  protected void perform(RelOptRuleCall call, Project project, Filter filter, 
Join join) {
+LOG.debug("Matched HiveAntiJoinRule");
+
+if (join.getCondition().isAlwaysTrue()) {
+  return;
+}
+
+//We support conversion from left outer join only.
+if (join.getJoinType() != JoinRelType.LEFT) {
+  return;
+}
+
+assert (filter != null);
+
+List aboveFilters = 
RelOptUtil.conjunctions(filter.getCondition());
+boolean hasIsNull = false;
+
+// Get all filter condition and check if any of them is a "is null" kind.
+for (RexNode filterNode : aboveFilters) {
+  if (filterNode.getKind() == SqlKind.IS_NULL &&
+  isFilterFromRightSide(join, filterNode, join.getJoinType())) {
+hasIsNull = true;
+break;
+  }
+}
+
+// Is null should be on a key from right side of the join.
+if (!hasIsNull) {
+  return;
+}
+
+// Build anti join with same left, right child and condition as original 
left outer join.
+Join anti = join.copy(join.getTraitSet(), join.getCondition(),
+join.getLeft(), join.getRight(), JoinRelType.ANTI, false);
+
+//TODO : Do we really need it
+call.getPlanner().onCopy(join, anti);
+
+RelNode newProject = getNewProjectNode(project, anti);
+if (newProject != null) {
+  call.getPlanner().onCopy(project, newProject);

Review comment:
   done

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinWithFilterToAntiJoinRule.java
##
@@ -0,0 +1,149 @@
+/*
+ * Licensed to

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463338&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463338
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 12:04
Start Date: 26/Jul/20 12:04
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460518974



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinWithFilterToAntiJoinRule.java
##
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite.rules;
+
+import org.apache.calcite.plan.RelOptRule;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.plan.RelOptUtil;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.rel.core.Filter;
+import org.apache.calcite.rel.core.Join;
+import org.apache.calcite.rel.core.JoinRelType;
+import org.apache.calcite.rel.core.Project;
+import org.apache.calcite.rel.type.RelDataTypeField;
+import org.apache.calcite.rex.RexInputRef;
+import org.apache.calcite.rex.RexNode;
+import org.apache.calcite.sql.SqlKind;
+import org.apache.calcite.util.ImmutableBitSet;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.ArrayList;
+import java.util.List;
+
+/**
+ * Planner rule that converts a join plus filter to anti join.
+ */
+public class HiveJoinWithFilterToAntiJoinRule extends RelOptRule {

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 463338)
Time Spent: 11h  (was: 10h 50m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 11h
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exist

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463339&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463339
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 12:05
Start Date: 26/Jul/20 12:05
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460519111



##
File path: ql/src/java/org/apache/hadoop/hive/ql/plan/VectorMapJoinDesc.java
##
@@ -89,7 +89,8 @@ public PrimitiveTypeInfo getPrimitiveTypeInfo() {
 INNER_BIG_ONLY,
 LEFT_SEMI,
 OUTER,
-FULL_OUTER
+FULL_OUTER,
+ANTI

Review comment:
   LEFT_ANTI





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 463339)
Time Spent: 11h 10m  (was: 11h)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 11h 10m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463340&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463340
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 12:10
Start Date: 26/Jul/20 12:10
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460519523



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinAddNotNullRule.java
##
@@ -56,6 +57,9 @@
   public static final HiveJoinAddNotNullRule INSTANCE_SEMIJOIN =
   new HiveJoinAddNotNullRule(HiveSemiJoin.class, 
HiveRelFactories.HIVE_FILTER_FACTORY);
 
+  public static final HiveJoinAddNotNullRule INSTANCE_ANTIJOIN =
+  new HiveJoinAddNotNullRule(HiveAntiJoin.class, 
HiveRelFactories.HIVE_FILTER_FACTORY);

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 463340)
Time Spent: 11h 20m  (was: 11h 10m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 11h 20m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463341&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463341
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 12:20
Start Date: 26/Jul/20 12:20
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460520647



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveSubQRemoveRelBuilder.java
##
@@ -1112,7 +1112,7 @@ public RexNode field(RexNode e, String name) {
   }
 
   public HiveSubQRemoveRelBuilder join(JoinRelType joinType, RexNode condition,
-   Set variablesSet, 
boolean createSemiJoin) {
+   Set variablesSet, 
JoinRelType semiJoinType) {

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 463341)
Time Spent: 11.5h  (was: 11h 20m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 11.5h
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463342&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463342
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 12:23
Start Date: 26/Jul/20 12:23
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460520903



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveRelOptMaterializationValidator.java
##
@@ -253,6 +256,14 @@ private RelNode visit(HiveSemiJoin semiJoin) {
 return visitChildren(semiJoin);
   }
 
+  // Note: Not currently part of the HiveRelNode interface
+  private RelNode visit(HiveAntiJoin antiJoin) {

Review comment:
   Not sure ..copy pasted from semi join.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 463342)
Time Spent: 11h 40m  (was: 11.5h)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 11h 40m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463343&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463343
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 12:23
Start Date: 26/Jul/20 12:23
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460520930



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveRelFactories.java
##
@@ -188,6 +193,20 @@ public RelNode createSemiJoin(RelNode left, RelNode right,
 }
   }
 
+  /**
+   * Implementation of {@link AntiJoinFactory} that returns
+   * {@link 
org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveAntiJoin}
+   * .
+   */
+  private static class HiveAntiJoinFactoryImpl implements SemiJoinFactory {

Review comment:
   HiveAntiJoinFactoryImpl is removed





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 463343)
Time Spent: 11h 50m  (was: 11h 40m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 11h 50m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463344&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463344
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 12:24
Start Date: 26/Jul/20 12:24
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460521056



##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java
##
@@ -153,6 +153,8 @@
 
   transient boolean hasLeftSemiJoin = false;
 
+  transient boolean hasAntiJoin = false;

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 463344)
Time Spent: 12h  (was: 11h 50m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 12h
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463345&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463345
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 12:26
Start Date: 26/Jul/20 12:26
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460521246



##
File path: parser/src/java/org/apache/hadoop/hive/ql/parse/FromClauseParser.g
##
@@ -145,6 +145,7 @@ joinToken
 | KW_RIGHT (KW_OUTER)? KW_JOIN -> TOK_RIGHTOUTERJOIN
 | KW_FULL  (KW_OUTER)? KW_JOIN -> TOK_FULLOUTERJOIN
 | KW_LEFT KW_SEMI KW_JOIN  -> TOK_LEFTSEMIJOIN
+| KW_ANTI KW_JOIN  -> TOK_ANTIJOIN

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 463345)
Time Spent: 12h 10m  (was: 12h)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 12h 10m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463346&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463346
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 12:27
Start Date: 26/Jul/20 12:27
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460521384



##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java
##
@@ -509,11 +513,17 @@ protected void addToAliasFilterTags(byte alias, 
List object, boolean isN
 }
   }
 
+  private void createForwardJoinObjectForAntiJoin(boolean[] skip) throws 
HiveException {
+boolean forward = fillFwdCache(skip);

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 463346)
Time Spent: 12h 20m  (was: 12h 10m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 12h 20m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463349&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463349
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 12:36
Start Date: 26/Jul/20 12:36
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460522312



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinLongOperator.java
##
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec.vector.mapjoin;
+
+import org.apache.hadoop.hive.ql.CompilationOpContext;
+import org.apache.hadoop.hive.ql.exec.JoinUtil;
+import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinLongHashSet;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.plan.OperatorDesc;
+import org.apache.hadoop.hive.ql.plan.VectorDesc;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Arrays;
+
+// TODO : Duplicate codes need to merge with semi join.
+// Single-Column Long hash table import.
+// Single-Column Long specific imports.
+
+/*
+ * Specialized class for doing a vectorized map join that is an anti join on a 
Single-Column Long
+ * using a hash set.
+ */
+public class VectorMapJoinAntiJoinLongOperator extends 
VectorMapJoinAntiJoinGenerateResultOperator {
+
+  private static final long serialVersionUID = 1L;
+  private static final String CLASS_NAME = 
VectorMapJoinAntiJoinLongOperator.class.getName();
+  private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME);
+  protected String getLoggingPrefix() {
+return super.getLoggingPrefix(CLASS_NAME);
+  }
+
+  // The above members are initialized by the constructor and must not be
+  // transient.
+
+  // The hash map for this specialized class.
+  private transient VectorMapJoinLongHashSet hashSet;
+
+  // Single-Column Long specific members.
+  // For integers, we have optional min/max filtering.
+  private transient boolean useMinMax;
+  private transient long min;
+  private transient long max;
+
+  // The column number for this one column join specialization.
+  private transient int singleJoinColumn;
+
+  // Pass-thru constructors.
+  /** Kryo ctor. */
+  protected VectorMapJoinAntiJoinLongOperator() {
+super();
+  }
+
+  public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx) {
+super(ctx);
+  }
+
+  public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx, 
OperatorDesc conf,
+   VectorizationContext vContext, 
VectorDesc vectorDesc) throws HiveException {
+super(ctx, conf, vContext, vectorDesc);
+  }
+
+  // Process Single-Column Long Anti Join on a vectorized row batch.
+  @Override
+  protected void commonSetup() throws HiveException {
+super.commonSetup();
+
+// Initialize Single-Column Long members for this specialized class.
+singleJoinColumn = bigTableKeyColumnMap[0];
+  }
+
+  @Override
+  public void hashTableSetup() throws HiveException {
+super.hashTableSetup();
+
+// Get our Single-Column Long hash set information for this specialized 
class.
+hashSet = (VectorMapJoinLongHashSet) vectorMapJoinHashTable;
+useMinMax = hashSet.useMinMax();
+if (useMinMax) {
+  min = hashSet.min();
+  max = hashSet.max();
+}
+  }
+
+  @Override
+  public void processBatch(VectorizedRowBatch batch) throws HiveException {
+
+try {
+  // (Currently none)
+  // antiPerBatchSetup(batch);
+
+  // For anti joins, we may apply the filter(s) now.
+  for(VectorExpression ve : bigTableFilterExpressions) {
+ve.evaluate(batch);
+

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463348&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463348
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 12:36
Start Date: 26/Jul/20 12:36
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460522261



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinStringOperator.java
##
@@ -0,0 +1,371 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec.vector.mapjoin;
+
+import org.apache.hadoop.hive.ql.CompilationOpContext;
+import org.apache.hadoop.hive.ql.exec.JoinUtil;
+import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.exec.vector.expressions.StringExpr;
+import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinBytesHashSet;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.plan.OperatorDesc;
+import org.apache.hadoop.hive.ql.plan.VectorDesc;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Arrays;
+
+// Single-Column String hash table import.
+// Single-Column String specific imports.
+
+// TODO : Duplicate codes need to merge with semi join.
+/*
+ * Specialized class for doing a vectorized map join that is an anti join on a 
Single-Column String
+ * using a hash set.
+ */
+public class VectorMapJoinAntiJoinStringOperator extends 
VectorMapJoinAntiJoinGenerateResultOperator {
+
+  private static final long serialVersionUID = 1L;
+
+  
//
+
+  private static final String CLASS_NAME = 
VectorMapJoinAntiJoinStringOperator.class.getName();
+  private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME);
+
+  protected String getLoggingPrefix() {
+return super.getLoggingPrefix(CLASS_NAME);
+  }
+
+  
//
+
+  // (none)
+
+  // The above members are initialized by the constructor and must not be
+  // transient.
+  //---
+
+  // The hash map for this specialized class.
+  private transient VectorMapJoinBytesHashSet hashSet;
+
+  //---
+  // Single-Column String specific members.
+  //
+
+  // The column number for this one column join specialization.
+  private transient int singleJoinColumn;
+
+  //---
+  // Pass-thru constructors.
+  //
+
+  /** Kryo ctor. */
+  protected VectorMapJoinAntiJoinStringOperator() {
+super();
+  }
+
+  public VectorMapJoinAntiJoinStringOperator(CompilationOpContext ctx) {
+super(ctx);
+  }
+
+  public VectorMapJoinAntiJoinStringOperator(CompilationOpContext ctx, 
OperatorDesc conf,
+ VectorizationContext vContext, 
VectorDesc vectorDesc) throws HiveException {
+super(ctx, conf, vContext, vectorDesc);
+  }
+
+  //---
+  // Process Single-Column String anti Join on a vectorized row batch.
+  //
+
+  @Override
+  protected void commonSetup() throws HiveException {
+super.commonSetup();
+
+/*
+ * Initialize Single-Column String members for this specialized class.
+ */
+
+singleJoinColumn = bigTableKeyColumnMap[0];
+  }
+
+  @Override
+  public void hashTableSetup() throws HiveException {
+super.hashTableSetup();
+
+/*
+ * Get our Single-Column String hash set information for

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463350&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463350
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 12:37
Start Date: 26/Jul/20 12:37
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460522454



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinLongOperator.java
##
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec.vector.mapjoin;
+
+import org.apache.hadoop.hive.ql.CompilationOpContext;
+import org.apache.hadoop.hive.ql.exec.JoinUtil;
+import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinLongHashSet;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.plan.OperatorDesc;
+import org.apache.hadoop.hive.ql.plan.VectorDesc;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Arrays;
+
+// TODO : Duplicate codes need to merge with semi join.
+// Single-Column Long hash table import.
+// Single-Column Long specific imports.
+
+/*
+ * Specialized class for doing a vectorized map join that is an anti join on a 
Single-Column Long
+ * using a hash set.
+ */
+public class VectorMapJoinAntiJoinLongOperator extends 
VectorMapJoinAntiJoinGenerateResultOperator {
+
+  private static final long serialVersionUID = 1L;
+  private static final String CLASS_NAME = 
VectorMapJoinAntiJoinLongOperator.class.getName();
+  private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME);
+  protected String getLoggingPrefix() {
+return super.getLoggingPrefix(CLASS_NAME);
+  }
+
+  // The above members are initialized by the constructor and must not be
+  // transient.
+
+  // The hash map for this specialized class.
+  private transient VectorMapJoinLongHashSet hashSet;
+
+  // Single-Column Long specific members.
+  // For integers, we have optional min/max filtering.
+  private transient boolean useMinMax;
+  private transient long min;
+  private transient long max;
+
+  // The column number for this one column join specialization.
+  private transient int singleJoinColumn;
+
+  // Pass-thru constructors.
+  /** Kryo ctor. */
+  protected VectorMapJoinAntiJoinLongOperator() {
+super();
+  }
+
+  public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx) {
+super(ctx);
+  }
+
+  public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx, 
OperatorDesc conf,
+   VectorizationContext vContext, 
VectorDesc vectorDesc) throws HiveException {
+super(ctx, conf, vContext, vectorDesc);
+  }
+
+  // Process Single-Column Long Anti Join on a vectorized row batch.
+  @Override
+  protected void commonSetup() throws HiveException {
+super.commonSetup();
+
+// Initialize Single-Column Long members for this specialized class.
+singleJoinColumn = bigTableKeyColumnMap[0];
+  }
+
+  @Override
+  public void hashTableSetup() throws HiveException {
+super.hashTableSetup();
+
+// Get our Single-Column Long hash set information for this specialized 
class.
+hashSet = (VectorMapJoinLongHashSet) vectorMapJoinHashTable;
+useMinMax = hashSet.useMinMax();
+if (useMinMax) {
+  min = hashSet.min();
+  max = hashSet.max();
+}
+  }
+
+  @Override
+  public void processBatch(VectorizedRowBatch batch) throws HiveException {
+
+try {
+  // (Currently none)
+  // antiPerBatchSetup(batch);
+
+  // For anti joins, we may apply the filter(s) now.
+  for(VectorExpression ve : bigTableFilterExpressions) {
+ve.evaluate(batch);
+

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463351&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463351
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 26/Jul/20 12:38
Start Date: 26/Jul/20 12:38
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460522562



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinLongOperator.java
##
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec.vector.mapjoin;
+
+import org.apache.hadoop.hive.ql.CompilationOpContext;
+import org.apache.hadoop.hive.ql.exec.JoinUtil;
+import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinLongHashSet;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.plan.OperatorDesc;
+import org.apache.hadoop.hive.ql.plan.VectorDesc;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Arrays;
+
+// TODO : Duplicate codes need to merge with semi join.
+// Single-Column Long hash table import.
+// Single-Column Long specific imports.
+
+/*
+ * Specialized class for doing a vectorized map join that is an anti join on a 
Single-Column Long
+ * using a hash set.
+ */
+public class VectorMapJoinAntiJoinLongOperator extends 
VectorMapJoinAntiJoinGenerateResultOperator {
+
+  private static final long serialVersionUID = 1L;
+  private static final String CLASS_NAME = 
VectorMapJoinAntiJoinLongOperator.class.getName();
+  private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME);
+  protected String getLoggingPrefix() {
+return super.getLoggingPrefix(CLASS_NAME);
+  }
+
+  // The above members are initialized by the constructor and must not be
+  // transient.
+
+  // The hash map for this specialized class.
+  private transient VectorMapJoinLongHashSet hashSet;
+
+  // Single-Column Long specific members.
+  // For integers, we have optional min/max filtering.
+  private transient boolean useMinMax;
+  private transient long min;
+  private transient long max;
+
+  // The column number for this one column join specialization.
+  private transient int singleJoinColumn;
+
+  // Pass-thru constructors.
+  /** Kryo ctor. */
+  protected VectorMapJoinAntiJoinLongOperator() {
+super();
+  }
+
+  public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx) {
+super(ctx);
+  }
+
+  public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx, 
OperatorDesc conf,
+   VectorizationContext vContext, 
VectorDesc vectorDesc) throws HiveException {
+super(ctx, conf, vContext, vectorDesc);
+  }
+
+  // Process Single-Column Long Anti Join on a vectorized row batch.
+  @Override
+  protected void commonSetup() throws HiveException {
+super.commonSetup();
+
+// Initialize Single-Column Long members for this specialized class.
+singleJoinColumn = bigTableKeyColumnMap[0];
+  }
+
+  @Override
+  public void hashTableSetup() throws HiveException {
+super.hashTableSetup();
+
+// Get our Single-Column Long hash set information for this specialized 
class.
+hashSet = (VectorMapJoinLongHashSet) vectorMapJoinHashTable;
+useMinMax = hashSet.useMinMax();
+if (useMinMax) {
+  min = hashSet.min();
+  max = hashSet.max();
+}
+  }
+
+  @Override
+  public void processBatch(VectorizedRowBatch batch) throws HiveException {
+
+try {
+  // (Currently none)
+  // antiPerBatchSetup(batch);
+
+  // For anti joins, we may apply the filter(s) now.
+  for(VectorExpression ve : bigTableFilterExpressions) {
+ve.evaluate(batch);
+

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463442&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463442
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 27/Jul/20 01:39
Start Date: 27/Jul/20 01:39
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460605498



##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java
##
@@ -638,6 +657,12 @@ private void genObject(int aliasNum, boolean allLeftFirst, 
boolean allLeftNull)
   // skipping the rest of the rows in the rhs table of the semijoin
   done = !needsPostEvaluation;
 }
+  } else if (type == JoinDesc.ANTI_JOIN) {
+if (innerJoin(skip, left, right)) {
+  // if anti join found a match then the condition is not matched for 
anti join, so we can skip rest of the

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 463442)
Time Spent: 13h 10m  (was: 13h)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 13h 10m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463443&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463443
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 27/Jul/20 01:40
Start Date: 27/Jul/20 01:40
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460605730



##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java
##
@@ -523,11 +533,19 @@ private boolean createForwardJoinObject(boolean[] skip) 
throws HiveException {
 forward = true;
   }
 }
+return forward;
+  }
+
+  // returns whether a record was forwarded
+  private boolean createForwardJoinObject(boolean[] skip, boolean antiJoin) 
throws HiveException {
+boolean forward = fillFwdCache(skip);
 if (forward) {
   if (needsPostEvaluation) {
 forward = !JoinUtil.isFiltered(forwardCache, residualJoinFilters, 
residualJoinFiltersOIs);
   }
-  if (forward) {
+
+  // For anti join, check all right side and if nothing is matched then 
only forward.

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 463443)
Time Spent: 13h 20m  (was: 13h 10m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 13h 20m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463444&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463444
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 27/Jul/20 01:40
Start Date: 27/Jul/20 01:40
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460605836



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinGenerateResultOperator.java
##
@@ -0,0 +1,218 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec.vector.mapjoin;
+
+import org.apache.hadoop.hive.ql.CompilationOpContext;
+import org.apache.hadoop.hive.ql.exec.JoinUtil;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinHashSet;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinHashSetResult;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinHashTableResult;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.plan.OperatorDesc;
+import org.apache.hadoop.hive.ql.plan.VectorDesc;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+
+// TODO : This class is duplicate of semi join. Need to do a refactoring to 
merge it with semi join.
+/**
+ * This class has methods for generating vectorized join results for Anti 
joins.
+ * The big difference between inner joins and anti joins is existence testing.
+ * Inner joins use a hash map to lookup the 1 or more small table values.
+ * Anti joins are a specialized join for outputting big table rows whose key 
exists
+ * in the small table.
+ *
+ * No small table values are needed for anti since they would be empty.  So,
+ * we use a hash set as the hash table.  Hash sets just report whether a key 
exists.  This
+ * is a big performance optimization.
+ */
+public abstract class VectorMapJoinAntiJoinGenerateResultOperator
+extends VectorMapJoinGenerateResultOperator {
+
+  private static final long serialVersionUID = 1L;
+  private static final Logger LOG = 
LoggerFactory.getLogger(VectorMapJoinAntiJoinGenerateResultOperator.class.getName());
+
+  // Anti join specific members.
+
+  // An array of hash set results so we can do lookups on the whole batch 
before output result
+  // generation.
+  protected transient VectorMapJoinHashSetResult hashSetResults[];
+
+  // Pre-allocated member for storing the (physical) batch index of matching 
row (single- or
+  // multi-small-table-valued) indexes during a process call.
+  protected transient int[] allMatchs;
+
+  // Pre-allocated member for storing the (physical) batch index of rows that 
need to be spilled.
+  protected transient int[] spills;
+
+  // Pre-allocated member for storing index into the hashSetResults for each 
spilled row.
+  protected transient int[] spillHashMapResultIndices;
+
+  /** Kryo ctor. */
+  protected VectorMapJoinAntiJoinGenerateResultOperator() {
+super();
+  }
+
+  public VectorMapJoinAntiJoinGenerateResultOperator(CompilationOpContext ctx) 
{
+super(ctx);
+  }
+
+  public VectorMapJoinAntiJoinGenerateResultOperator(CompilationOpContext ctx, 
OperatorDesc conf,
+ VectorizationContext 
vContext, VectorDesc vectorDesc) throws HiveException {
+super(ctx, conf, vContext, vectorDesc);
+  }
+
+  /*
+   * Setup our anti join specific members.
+   */
+  protected void commonSetup() throws HiveException {
+super.commonSetup();
+
+// Anti join specific.
+VectorMapJoinHashSet baseHashSet = (VectorMapJoinHashSet) 
vectorMapJoinHashTable;
+
+hashSetResults = new 
VectorMapJoinHashSetResult[VectorizedRowBatch.DEFAULT_SIZE];
+for (int i = 0; i < hashS

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-26 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=463446&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-463446
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 27/Jul/20 01:41
Start Date: 27/Jul/20 01:41
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r460606016



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinGenerateResultOperator.java
##
@@ -0,0 +1,218 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec.vector.mapjoin;
+
+import org.apache.hadoop.hive.ql.CompilationOpContext;
+import org.apache.hadoop.hive.ql.exec.JoinUtil;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinHashSet;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinHashSetResult;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinHashTableResult;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.plan.OperatorDesc;
+import org.apache.hadoop.hive.ql.plan.VectorDesc;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+
+// TODO : This class is duplicate of semi join. Need to do a refactoring to 
merge it with semi join.
+/**
+ * This class has methods for generating vectorized join results for Anti 
joins.
+ * The big difference between inner joins and anti joins is existence testing.
+ * Inner joins use a hash map to lookup the 1 or more small table values.
+ * Anti joins are a specialized join for outputting big table rows whose key 
exists

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 463446)
Time Spent: 13h 40m  (was: 13.5h)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 13h 40m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=465942&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465942
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 23:02
Start Date: 03/Aug/20 23:02
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r464673502



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveRelOptUtil.java
##
@@ -747,6 +747,8 @@ public static RewritablePKFKJoinInfo 
isRewritablePKFKJoin(Join join,
 final RelNode nonFkInput = leftInputPotentialFK ? join.getRight() : 
join.getLeft();
 final RewritablePKFKJoinInfo nonRewritable = 
RewritablePKFKJoinInfo.of(false, null);
 
+// TODO : Need to handle Anti join.

Review comment:
   Thanks for creating HIVE-23906. Can we simply return `nonRewritable` if 
it is an anti-join for the time being, rather than proceeding? This certainly 
requires a bit of extra thinking and specific tests to make sure it is working 
as expected (for which we already have HIVE-23906).

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinConstraintsRule.java
##
@@ -183,6 +189,7 @@ public void onMatch(RelOptRuleCall call) {
 switch (joinType) {
 case SEMI:
 case INNER:
+case ANTI:

Review comment:
   This should be removed to avoid confusion, since we bail out above.

##
File path: ql/src/test/queries/clientpositive/subquery_in_having.q
##
@@ -140,6 +140,22 @@ CREATE TABLE src_null_n4 (key STRING COMMENT 'default', 
value STRING COMMENT 'de
 LOAD DATA LOCAL INPATH "../../data/files/kv1.txt" INTO TABLE src_null_n4;
 INSERT INTO src_null_n4 values('5444', null);
 
+explain
+select key, value, count(*)

Review comment:
   Should we execute this query with conversion=true?

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveCalciteUtil.java
##
@@ -1233,4 +1233,21 @@ public FixNullabilityShuttle(RexBuilder rexBuilder,
 }
   }
 
+  // Checks if any of the expression given as list expressions are from right 
side of the join.

Review comment:
   nit. Change comment to javadoc

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAntiSemiJoinRule.java
##
@@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite.rules;
+
+import org.apache.calcite.plan.RelOptRule;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.plan.RelOptUtil;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.rel.core.Filter;
+import org.apache.calcite.rel.core.Join;
+import org.apache.calcite.rel.core.JoinRelType;
+import org.apache.calcite.rel.core.Project;
+import org.apache.calcite.rex.RexNode;
+import org.apache.calcite.sql.SqlKind;
+import org.apache.hadoop.hive.ql.optimizer.calcite.HiveCalciteUtil;
+import org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveAntiJoin;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.List;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+/**
+ * Planner rule that converts a join plus filter to anti join.
+ */
+public class HiveAntiSemiJoinRule extends RelOptRule {
+  protected static final Logger LOG = 
LoggerFactory.getLogger(HiveAntiSemiJoinRule.class);
+  public static final HiveAntiSemiJoinRule INSTANCE = new 
HiveAntiSemiJoinRule();
+
+  //HiveProject(fld=[$0])
+  //  HiveFilter(condition=[IS NULL($1)])
+  //HiveJoin(condition=[=($0, $1)], joinType=[left], algorithm=[none], 
cost=[not available])
+  //
+  // TO
+  //
+  //HiveProject(fld_tbl=[$0])
+  //  HiveAntiJoin(condition=[=($0, $1)], joinType=[anti])
+  //
+  public HiveAntiSemiJoinRule() {
+super(operand(Project.class, operand(Filter.class, operand(Join.class, 
RelOptRule.any(,
+"HiveJoinWithFilterToAntiJoinRule:filter");

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-08-03 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=465943&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-465943
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 03/Aug/20 23:03
Start Date: 03/Aug/20 23:03
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on pull request #1147:
URL: https://github.com/apache/hive/pull/1147#issuecomment-668281996


   @maheshk114 , thanks for addressing the first batch of comments. PR looks 
better. I have done a second pass and left some additional comments that should 
be addressed before merging. Please, also merge master into your branch, since 
there seem to be some conflicts.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 465943)
Time Spent: 14h  (was: 13h 50m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 14h
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-08-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=466425&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-466425
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 04/Aug/20 20:02
Start Date: 04/Aug/20 20:02
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r465297563



##
File path: ql/src/test/queries/clientpositive/subquery_in_having.q
##
@@ -140,6 +140,22 @@ CREATE TABLE src_null_n4 (key STRING COMMENT 'default', 
value STRING COMMENT 'de
 LOAD DATA LOCAL INPATH "../../data/files/kv1.txt" INTO TABLE src_null_n4;
 INSERT INTO src_null_n4 values('5444', null);
 
+explain
+select key, value, count(*)

Review comment:
   By default anti join conversion is set to true. I have added few test 
cases with anti join set to false.

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/ppd/PredicateTransitivePropagate.java
##
@@ -203,6 +203,7 @@ private boolean filterExists(ReduceSinkOperator target, 
ExprNodeDesc replaced) {
   vector.add(right, left);
   break;
 case JoinDesc.LEFT_OUTER_JOIN:
+case JoinDesc.ANTI_JOIN: //TODO : need to test

Review comment:
   removed the comment.

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinConstraintsRule.java
##
@@ -183,6 +189,7 @@ public void onMatch(RelOptRuleCall call) {
 switch (joinType) {
 case SEMI:
 case INNER:
+case ANTI:

Review comment:
   done

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinAddNotNullRule.java
##
@@ -92,7 +104,7 @@ public void onMatch(RelOptRuleCall call) {
 Set rightPushedPredicates = 
Sets.newHashSet(registry.getPushedPredicates(join, 1));
 
 boolean genPredOnLeft = join.getJoinType() == JoinRelType.RIGHT || 
join.getJoinType() == JoinRelType.INNER || join.isSemiJoin();
-boolean genPredOnRight = join.getJoinType() == JoinRelType.LEFT || 
join.getJoinType() == JoinRelType.INNER || join.isSemiJoin();
+boolean genPredOnRight = join.getJoinType() == JoinRelType.LEFT || 
join.getJoinType() == JoinRelType.INNER || join.isSemiJoin()|| 
join.getJoinType() == JoinRelType.ANTI;

Review comment:
   Yes ..if right side is null then it emits all the right side records

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAntiSemiJoinRule.java
##
@@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite.rules;
+
+import org.apache.calcite.plan.RelOptRule;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.plan.RelOptUtil;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.rel.core.Filter;
+import org.apache.calcite.rel.core.Join;
+import org.apache.calcite.rel.core.JoinRelType;
+import org.apache.calcite.rel.core.Project;
+import org.apache.calcite.rex.RexNode;
+import org.apache.calcite.sql.SqlKind;
+import org.apache.hadoop.hive.ql.optimizer.calcite.HiveCalciteUtil;
+import org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveAntiJoin;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.List;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+/**
+ * Planner rule that converts a join plus filter to anti join.
+ */
+public class HiveAntiSemiJoinRule extends RelOptRule {
+  protected static final Logger LOG = 
LoggerFactory.getLogger(HiveAntiSemiJoinRule.class);
+  public static final HiveAntiSemiJoinRule INSTANCE = new 
HiveAntiSemiJoinRule();
+
+  //HiveProject(fld=[$0])
+  //  HiveFilter(condition=[IS NULL($1)])
+  //HiveJoin(condition=[=($0, $1)], joinType=[left], algorithm=[none], 
cost=[not available])
+  //
+  // TO
+  //
+  //HiveProject(fld_tbl=[$0])
+  //  HiveAntiJoin(condition=[=($0, $1)], joinType=[anti])
+  //
+  public HiveAntiSemiJoinRule() {
+super(operand(Pro

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-08-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=466569&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-466569
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 05/Aug/20 03:12
Start Date: 05/Aug/20 03:12
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r465446298



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAntiSemiJoinRule.java
##
@@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite.rules;
+
+import org.apache.calcite.plan.RelOptRule;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.plan.RelOptUtil;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.rel.core.Filter;
+import org.apache.calcite.rel.core.Join;
+import org.apache.calcite.rel.core.JoinRelType;
+import org.apache.calcite.rel.core.Project;
+import org.apache.calcite.rex.RexNode;
+import org.apache.calcite.sql.SqlKind;
+import org.apache.hadoop.hive.ql.optimizer.calcite.HiveCalciteUtil;
+import org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveAntiJoin;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.List;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+/**
+ * Planner rule that converts a join plus filter to anti join.
+ */
+public class HiveAntiSemiJoinRule extends RelOptRule {
+  protected static final Logger LOG = 
LoggerFactory.getLogger(HiveAntiSemiJoinRule.class);
+  public static final HiveAntiSemiJoinRule INSTANCE = new 
HiveAntiSemiJoinRule();
+
+  //HiveProject(fld=[$0])
+  //  HiveFilter(condition=[IS NULL($1)])
+  //HiveJoin(condition=[=($0, $1)], joinType=[left], algorithm=[none], 
cost=[not available])
+  //
+  // TO
+  //
+  //HiveProject(fld_tbl=[$0])
+  //  HiveAntiJoin(condition=[=($0, $1)], joinType=[anti])
+  //
+  public HiveAntiSemiJoinRule() {
+super(operand(Project.class, operand(Filter.class, operand(Join.class, 
RelOptRule.any(,
+"HiveJoinWithFilterToAntiJoinRule:filter");
+  }
+
+  // is null filter over a left join.
+  public void onMatch(final RelOptRuleCall call) {
+final Project project = call.rel(0);
+final Filter filter = call.rel(1);
+final Join join = call.rel(2);
+perform(call, project, filter, join);
+  }
+
+  protected void perform(RelOptRuleCall call, Project project, Filter filter, 
Join join) {
+LOG.debug("Start Matching HiveAntiJoinRule");
+
+//TODO : Need to support this scenario.
+if (join.getCondition().isAlwaysTrue()) {
+  return;
+}
+
+//We support conversion from left outer join only.
+if (join.getJoinType() != JoinRelType.LEFT) {
+  return;
+}
+
+assert (filter != null);
+
+// If null filter is not present from right side then we can not convert 
to anti join.
+List aboveFilters = 
RelOptUtil.conjunctions(filter.getCondition());
+Stream nullFilters = aboveFilters.stream().filter(filterNode -> 
filterNode.getKind() == SqlKind.IS_NULL);
+boolean hasNullFilter = 
HiveCalciteUtil.hasAnyExpressionFromRightSide(join, 
nullFilters.collect(Collectors.toList()));
+if (!hasNullFilter) {
+  return;
+}
+
+// If any projection is there from right side, then we can not convert to 
anti join.
+boolean hasProjection = 
HiveCalciteUtil.hasAnyExpressionFromRightSide(join, project.getProjects());
+if (hasProjection) {
+  return;
+}
+
+LOG.debug("Matched HiveAntiJoinRule");
+
+// Build anti join with same left, right child and condition as original 
left outer join.
+Join anti = HiveAntiJoin.getAntiJoin(join.getLeft().getCluster(), 
join.getLeft().getTraitSet(),
+join.getLeft(), join.getRight(), join.getCondition());
+RelNode newProject = project.copy(project.getTraitSet(), anti, 
project.getProjects(), project.getRowType());
+call.transformTo(newProject);

Review comment:

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-08-04 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=466570&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-466570
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 05/Aug/20 03:13
Start Date: 05/Aug/20 03:13
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r465446495



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinAddNotNullRule.java
##
@@ -92,7 +104,7 @@ public void onMatch(RelOptRuleCall call) {
 Set rightPushedPredicates = 
Sets.newHashSet(registry.getPushedPredicates(join, 1));
 
 boolean genPredOnLeft = join.getJoinType() == JoinRelType.RIGHT || 
join.getJoinType() == JoinRelType.INNER || join.isSemiJoin();
-boolean genPredOnRight = join.getJoinType() == JoinRelType.LEFT || 
join.getJoinType() == JoinRelType.INNER || join.isSemiJoin();
+boolean genPredOnRight = join.getJoinType() == JoinRelType.LEFT || 
join.getJoinType() == JoinRelType.INNER || join.isSemiJoin()|| 
join.getJoinType() == JoinRelType.ANTI;

Review comment:
   I was referring to empty input (no rows) rather than null.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 466570)
Time Spent: 14.5h  (was: 14h 20m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 14.5h
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-08-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=467702&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-467702
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 07/Aug/20 04:46
Start Date: 07/Aug/20 04:46
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r466818194



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinLongOperator.java
##
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec.vector.mapjoin;
+
+import org.apache.hadoop.hive.ql.CompilationOpContext;
+import org.apache.hadoop.hive.ql.exec.JoinUtil;
+import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinLongHashSet;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.plan.OperatorDesc;
+import org.apache.hadoop.hive.ql.plan.VectorDesc;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Arrays;
+
+// TODO : Duplicate codes need to merge with semi join.
+// Single-Column Long hash table import.
+// Single-Column Long specific imports.
+
+/*
+ * Specialized class for doing a vectorized map join that is an anti join on a 
Single-Column Long
+ * using a hash set.
+ */
+public class VectorMapJoinAntiJoinLongOperator extends 
VectorMapJoinAntiJoinGenerateResultOperator {
+
+  private static final long serialVersionUID = 1L;
+  private static final String CLASS_NAME = 
VectorMapJoinAntiJoinLongOperator.class.getName();
+  private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME);
+  protected String getLoggingPrefix() {
+return super.getLoggingPrefix(CLASS_NAME);
+  }
+
+  // The above members are initialized by the constructor and must not be
+  // transient.
+
+  // The hash map for this specialized class.
+  private transient VectorMapJoinLongHashSet hashSet;
+
+  // Single-Column Long specific members.
+  // For integers, we have optional min/max filtering.
+  private transient boolean useMinMax;
+  private transient long min;
+  private transient long max;
+
+  // The column number for this one column join specialization.
+  private transient int singleJoinColumn;
+
+  // Pass-thru constructors.
+  /** Kryo ctor. */
+  protected VectorMapJoinAntiJoinLongOperator() {
+super();
+  }
+
+  public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx) {
+super(ctx);
+  }
+
+  public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx, 
OperatorDesc conf,
+   VectorizationContext vContext, 
VectorDesc vectorDesc) throws HiveException {
+super(ctx, conf, vContext, vectorDesc);
+  }
+
+  // Process Single-Column Long Anti Join on a vectorized row batch.
+  @Override
+  protected void commonSetup() throws HiveException {
+super.commonSetup();
+
+// Initialize Single-Column Long members for this specialized class.
+singleJoinColumn = bigTableKeyColumnMap[0];
+  }
+
+  @Override
+  public void hashTableSetup() throws HiveException {
+super.hashTableSetup();
+
+// Get our Single-Column Long hash set information for this specialized 
class.
+hashSet = (VectorMapJoinLongHashSet) vectorMapJoinHashTable;
+useMinMax = hashSet.useMinMax();
+if (useMinMax) {
+  min = hashSet.min();
+  max = hashSet.max();
+}
+  }
+
+  @Override
+  public void processBatch(VectorizedRowBatch batch) throws HiveException {
+
+try {
+  // (Currently none)
+  // antiPerBatchSetup(batch);
+
+  // For anti joins, we may apply the filter(s) now.
+  for(VectorExpression ve : bigTableFilterExpressions) {
+ve.evaluate(batch);
+

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-08-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=467703&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-467703
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 07/Aug/20 04:47
Start Date: 07/Aug/20 04:47
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r466818358



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinLongOperator.java
##
@@ -0,0 +1,315 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec.vector.mapjoin;
+
+import org.apache.hadoop.hive.ql.CompilationOpContext;
+import org.apache.hadoop.hive.ql.exec.JoinUtil;
+import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinLongHashSet;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.plan.OperatorDesc;
+import org.apache.hadoop.hive.ql.plan.VectorDesc;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Arrays;
+
+// TODO : Duplicate codes need to merge with semi join.
+// Single-Column Long hash table import.
+// Single-Column Long specific imports.
+
+/*
+ * Specialized class for doing a vectorized map join that is an anti join on a 
Single-Column Long
+ * using a hash set.
+ */
+public class VectorMapJoinAntiJoinLongOperator extends 
VectorMapJoinAntiJoinGenerateResultOperator {
+
+  private static final long serialVersionUID = 1L;
+  private static final String CLASS_NAME = 
VectorMapJoinAntiJoinLongOperator.class.getName();
+  private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME);
+  protected String getLoggingPrefix() {
+return super.getLoggingPrefix(CLASS_NAME);
+  }
+
+  // The above members are initialized by the constructor and must not be
+  // transient.
+
+  // The hash map for this specialized class.
+  private transient VectorMapJoinLongHashSet hashSet;
+
+  // Single-Column Long specific members.
+  // For integers, we have optional min/max filtering.
+  private transient boolean useMinMax;
+  private transient long min;
+  private transient long max;
+
+  // The column number for this one column join specialization.
+  private transient int singleJoinColumn;
+
+  // Pass-thru constructors.
+  /** Kryo ctor. */
+  protected VectorMapJoinAntiJoinLongOperator() {
+super();
+  }
+
+  public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx) {
+super(ctx);
+  }
+
+  public VectorMapJoinAntiJoinLongOperator(CompilationOpContext ctx, 
OperatorDesc conf,
+   VectorizationContext vContext, 
VectorDesc vectorDesc) throws HiveException {
+super(ctx, conf, vContext, vectorDesc);
+  }
+
+  // Process Single-Column Long Anti Join on a vectorized row batch.
+  @Override
+  protected void commonSetup() throws HiveException {
+super.commonSetup();
+
+// Initialize Single-Column Long members for this specialized class.
+singleJoinColumn = bigTableKeyColumnMap[0];
+  }
+
+  @Override
+  public void hashTableSetup() throws HiveException {
+super.hashTableSetup();
+
+// Get our Single-Column Long hash set information for this specialized 
class.
+hashSet = (VectorMapJoinLongHashSet) vectorMapJoinHashTable;
+useMinMax = hashSet.useMinMax();
+if (useMinMax) {
+  min = hashSet.min();
+  max = hashSet.max();
+}
+  }
+
+  @Override
+  public void processBatch(VectorizedRowBatch batch) throws HiveException {
+
+try {
+  // (Currently none)
+  // antiPerBatchSetup(batch);
+
+  // For anti joins, we may apply the filter(s) now.
+  for(VectorExpression ve : bigTableFilterExpressions) {
+ve.evaluate(batch);
+

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-08-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=467704&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-467704
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 07/Aug/20 04:48
Start Date: 07/Aug/20 04:48
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r466818492



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/VectorMapJoinAntiJoinMultiKeyOperator.java
##
@@ -0,0 +1,400 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.exec.vector.mapjoin;
+
+import org.apache.hadoop.hive.ql.CompilationOpContext;
+import org.apache.hadoop.hive.ql.exec.JoinUtil;
+import org.apache.hadoop.hive.ql.exec.vector.VectorSerializeRow;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizationContext;
+import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
+import org.apache.hadoop.hive.ql.exec.vector.expressions.VectorExpression;
+import 
org.apache.hadoop.hive.ql.exec.vector.mapjoin.hashtable.VectorMapJoinBytesHashSet;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.plan.OperatorDesc;
+import org.apache.hadoop.hive.ql.plan.VectorDesc;
+import org.apache.hadoop.hive.serde2.ByteStream.Output;
+import 
org.apache.hadoop.hive.serde2.binarysortable.fast.BinarySortableSerializeWrite;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Arrays;
+
+// Multi-Key hash table import.
+// Multi-Key specific imports.
+
+// TODO : Duplicate codes need to merge with semi join.
+/*
+ * Specialized class for doing a vectorized map join that is an anti join on 
Multi-Key
+ * using hash set.
+ */
+public class VectorMapJoinAntiJoinMultiKeyOperator extends 
VectorMapJoinAntiJoinGenerateResultOperator {
+
+  private static final long serialVersionUID = 1L;
+
+  
//
+
+  private static final String CLASS_NAME = 
VectorMapJoinAntiJoinMultiKeyOperator.class.getName();
+  private static final Logger LOG = LoggerFactory.getLogger(CLASS_NAME);
+
+  protected String getLoggingPrefix() {
+return super.getLoggingPrefix(CLASS_NAME);
+  }
+
+  
//
+
+  // (none)
+
+  // The above members are initialized by the constructor and must not be
+  // transient.
+  //---
+
+  // The hash map for this specialized class.
+  private transient VectorMapJoinBytesHashSet hashSet;
+
+  //---
+  // Multi-Key specific members.
+  //
+
+  // Object that can take a set of columns in row in a vectorized row batch 
and serialized it.
+  // Known to not have any nulls.
+  private transient VectorSerializeRow keyVectorSerializeWrite;
+
+  // The BinarySortable serialization of the current key.
+  private transient Output currentKeyOutput;
+
+  // The BinarySortable serialization of the saved key for a possible series 
of equal keys.
+  private transient Output saveKeyOutput;
+
+  //---
+  // Pass-thru constructors.
+  //
+
+  /** Kryo ctor. */
+  protected VectorMapJoinAntiJoinMultiKeyOperator() {
+super();
+  }
+
+  public VectorMapJoinAntiJoinMultiKeyOperator(CompilationOpContext ctx) {
+super(ctx);
+  }
+
+  public VectorMapJoinAntiJoinMultiKeyOperator(CompilationOpContext ctx, 
OperatorDesc conf,
+   VectorizationContext vContext, 
VectorDesc vectorDesc) throws HiveException {
+super(ctx, conf, vContext, vectorDesc);
+  }
+
+  //---
+  // Process Multi-Key Anti Join on a vectorized row batch.
+  //
+
+  @Override
+  protected void commonSetup() thro

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-08-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=467706&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-467706
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 07/Aug/20 04:50
Start Date: 07/Aug/20 04:50
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r466819149



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAntiSemiJoinRule.java
##
@@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite.rules;
+
+import org.apache.calcite.plan.RelOptRule;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.plan.RelOptUtil;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.rel.core.Filter;
+import org.apache.calcite.rel.core.Join;
+import org.apache.calcite.rel.core.JoinRelType;
+import org.apache.calcite.rel.core.Project;
+import org.apache.calcite.rex.RexNode;
+import org.apache.calcite.sql.SqlKind;
+import org.apache.hadoop.hive.ql.optimizer.calcite.HiveCalciteUtil;
+import org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveAntiJoin;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.List;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+/**
+ * Planner rule that converts a join plus filter to anti join.
+ */
+public class HiveAntiSemiJoinRule extends RelOptRule {
+  protected static final Logger LOG = 
LoggerFactory.getLogger(HiveAntiSemiJoinRule.class);
+  public static final HiveAntiSemiJoinRule INSTANCE = new 
HiveAntiSemiJoinRule();
+
+  //HiveProject(fld=[$0])
+  //  HiveFilter(condition=[IS NULL($1)])
+  //HiveJoin(condition=[=($0, $1)], joinType=[left], algorithm=[none], 
cost=[not available])
+  //
+  // TO
+  //
+  //HiveProject(fld_tbl=[$0])
+  //  HiveAntiJoin(condition=[=($0, $1)], joinType=[anti])
+  //
+  public HiveAntiSemiJoinRule() {
+super(operand(Project.class, operand(Filter.class, operand(Join.class, 
RelOptRule.any(,
+"HiveJoinWithFilterToAntiJoinRule:filter");
+  }
+
+  // is null filter over a left join.
+  public void onMatch(final RelOptRuleCall call) {
+final Project project = call.rel(0);
+final Filter filter = call.rel(1);
+final Join join = call.rel(2);
+perform(call, project, filter, join);
+  }
+
+  protected void perform(RelOptRuleCall call, Project project, Filter filter, 
Join join) {
+LOG.debug("Start Matching HiveAntiJoinRule");
+
+//TODO : Need to support this scenario.
+if (join.getCondition().isAlwaysTrue()) {
+  return;
+}
+
+//We support conversion from left outer join only.
+if (join.getJoinType() != JoinRelType.LEFT) {
+  return;
+}
+
+assert (filter != null);
+
+// If null filter is not present from right side then we can not convert 
to anti join.
+List aboveFilters = 
RelOptUtil.conjunctions(filter.getCondition());
+Stream nullFilters = aboveFilters.stream().filter(filterNode -> 
filterNode.getKind() == SqlKind.IS_NULL);
+boolean hasNullFilter = 
HiveCalciteUtil.hasAnyExpressionFromRightSide(join, 
nullFilters.collect(Collectors.toList()));
+if (!hasNullFilter) {
+  return;
+}
+
+// If any projection is there from right side, then we can not convert to 
anti join.
+boolean hasProjection = 
HiveCalciteUtil.hasAnyExpressionFromRightSide(join, project.getProjects());
+if (hasProjection) {
+  return;
+}
+
+LOG.debug("Matched HiveAntiJoinRule");
+
+// Build anti join with same left, right child and condition as original 
left outer join.
+Join anti = HiveAntiJoin.getAntiJoin(join.getLeft().getCluster(), 
join.getLeft().getTraitSet(),
+join.getLeft(), join.getRight(), join.getCondition());
+RelNode newProject = project.copy(project.getTraitSet(), anti, 
project.getProjects(), project.getRowType());
+call.transformTo(newProject);

Review comment

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-08-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=467708&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-467708
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 07/Aug/20 04:52
Start Date: 07/Aug/20 04:52
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r466819572



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinAddNotNullRule.java
##
@@ -92,7 +104,7 @@ public void onMatch(RelOptRuleCall call) {
 Set rightPushedPredicates = 
Sets.newHashSet(registry.getPushedPredicates(join, 1));
 
 boolean genPredOnLeft = join.getJoinType() == JoinRelType.RIGHT || 
join.getJoinType() == JoinRelType.INNER || join.isSemiJoin();
-boolean genPredOnRight = join.getJoinType() == JoinRelType.LEFT || 
join.getJoinType() == JoinRelType.INNER || join.isSemiJoin();
+boolean genPredOnRight = join.getJoinType() == JoinRelType.LEFT || 
join.getJoinType() == JoinRelType.INNER || join.isSemiJoin()|| 
join.getJoinType() == JoinRelType.ANTI;

Review comment:
   yes ..that is taken care of.
   // For anti join, we should proceed to emit records if the right side is 
empty or not matching.
   if (type == JoinDesc.ANTI_JOIN && !producedRow) {





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 467708)
Time Spent: 15h 20m  (was: 15h 10m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 15h 20m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-08-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=467709&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-467709
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 07/Aug/20 05:04
Start Date: 07/Aug/20 05:04
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r466822516



##
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java
##
@@ -2129,6 +2133,16 @@ private RelNode applyPreJoinOrderingTransforms(RelNode 
basePlan, RelMetadataProv
 HiveRemoveSqCountCheck.INSTANCE);
   }
 
+  // 10. Convert left outer join + null filter on right side table column 
to anti join. Add this
+  // rule after all the optimization for which calcite support for anti 
join is missing.
+  // Needs to be done before ProjectRemoveRule as it expect a project over 
filter.
+  // This is done before join re-ordering as join re-ordering is 
converting the left outer

Review comment:
   As discussed, i have created a Jira 
https://issues.apache.org/jira/browse/HIVE-24013





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 467709)
Time Spent: 15.5h  (was: 15h 20m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 15.5h
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-08-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=467710&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-467710
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 07/Aug/20 05:10
Start Date: 07/Aug/20 05:10
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r466823815



##
File path: ql/src/test/results/clientpositive/llap/antijoin.q.out
##
@@ -0,0 +1,1007 @@
+PREHOOK: query: create table t1_n55 as select cast(key as int) key, value from 
src where key <= 10
+PREHOOK: type: CREATETABLE_AS_SELECT
+PREHOOK: Input: default@src
+PREHOOK: Output: database:default
+PREHOOK: Output: default@t1_n55
+POSTHOOK: query: create table t1_n55 as select cast(key as int) key, value 
from src where key <= 10
+POSTHOOK: type: CREATETABLE_AS_SELECT
+POSTHOOK: Input: default@src
+POSTHOOK: Output: database:default
+POSTHOOK: Output: default@t1_n55
+POSTHOOK: Lineage: t1_n55.key EXPRESSION [(src)src.FieldSchema(name:key, 
type:string, comment:default), ]
+POSTHOOK: Lineage: t1_n55.value SIMPLE [(src)src.FieldSchema(name:value, 
type:string, comment:default), ]
+PREHOOK: query: select * from t1_n55 sort by key
+PREHOOK: type: QUERY
+PREHOOK: Input: default@t1_n55
+ A masked pattern was here 
+POSTHOOK: query: select * from t1_n55 sort by key
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@t1_n55
+ A masked pattern was here 
+0  val_0

Review comment:
   now i have made anti join conversion to true by default





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 467710)
Time Spent: 15h 40m  (was: 15.5h)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 15h 40m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-08-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=467711&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-467711
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 07/Aug/20 05:17
Start Date: 07/Aug/20 05:17
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r466825617



##
File path: 
ql/src/test/results/clientpositive/llap/subquery_notexists_having.q.out
##
@@ -31,7 +31,8 @@ STAGE PLANS:
 Tez
  A masked pattern was here 
   Edges:
-Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 3 (SIMPLE_EDGE)
+Reducer 2 <- Map 1 (SIMPLE_EDGE)

Review comment:
   yes ..the join is getting converted to SMB join ..and so no reducer is 
required. In case of anti join its not getting converted. That is because left 
outer is adding an extra group by which is making the RS node on left and right 
side  equal, the pre-condition for converting to SMB join. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 467711)
Time Spent: 15h 50m  (was: 15h 40m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 15h 50m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-08-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=467713&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-467713
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 07/Aug/20 05:19
Start Date: 07/Aug/20 05:19
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r466826103



##
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##
@@ -2162,7 +2162,8 @@ private static void populateLlapDaemonVarsSet(Set 
llapDaemonVarsSetLocal
 "Whether Hive enables the optimization about converting common join 
into mapjoin based on the input file size. \n" +
 "If this parameter is on, and the sum of size for n-1 of the 
tables/partitions for a n-way join is smaller than the\n" +
 "specified size, the join is directly converted to a mapjoin (there is 
no conditional task)."),
-
+HIVE_CONVERT_ANTI_JOIN("hive.auto.convert.anti.join", false,

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 467713)
Time Spent: 16h  (was: 15h 50m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 16h
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-08-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=467716&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-467716
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 07/Aug/20 05:21
Start Date: 07/Aug/20 05:21
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r466826675



##
File path: 
ql/src/test/results/clientpositive/perf/tez/cbo_query16_anti_join.q.out
##
@@ -0,0 +1,99 @@
+PREHOOK: query: explain cbo

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 467716)
Time Spent: 16h 10m  (was: 16h)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 16h 10m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-08-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=467717&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-467717
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 07/Aug/20 05:22
Start Date: 07/Aug/20 05:22
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r466826953



##
File path: 
ql/src/test/results/clientpositive/perf/tez/constraints/cbo_query94_anti_join.q.out
##
@@ -0,0 +1,94 @@
+PREHOOK: query: explain cbo
+select  
+   count(distinct ws_order_number) as `order count`
+  ,sum(ws_ext_ship_cost) as `total shipping cost`
+  ,sum(ws_net_profit) as `total net profit`
+from
+   web_sales ws1
+  ,date_dim
+  ,customer_address
+  ,web_site
+where
+d_date between '1999-5-01' and 
+   (cast('1999-5-01' as date) + 60 days)
+and ws1.ws_ship_date_sk = d_date_sk
+and ws1.ws_ship_addr_sk = ca_address_sk
+and ca_state = 'TX'
+and ws1.ws_web_site_sk = web_site_sk
+and web_company_name = 'pri'
+and exists (select *
+from web_sales ws2
+where ws1.ws_order_number = ws2.ws_order_number
+  and ws1.ws_warehouse_sk <> ws2.ws_warehouse_sk)
+and not exists(select *
+   from web_returns wr1
+   where ws1.ws_order_number = wr1.wr_order_number)
+order by count(distinct ws_order_number)
+limit 100
+PREHOOK: type: QUERY
+PREHOOK: Input: default@customer_address
+PREHOOK: Input: default@date_dim
+PREHOOK: Input: default@web_returns
+PREHOOK: Input: default@web_sales
+PREHOOK: Input: default@web_site
+PREHOOK: Output: hdfs://### HDFS PATH ###
+POSTHOOK: query: explain cbo
+select  
+   count(distinct ws_order_number) as `order count`
+  ,sum(ws_ext_ship_cost) as `total shipping cost`
+  ,sum(ws_net_profit) as `total net profit`
+from
+   web_sales ws1
+  ,date_dim
+  ,customer_address
+  ,web_site
+where
+d_date between '1999-5-01' and 
+   (cast('1999-5-01' as date) + 60 days)
+and ws1.ws_ship_date_sk = d_date_sk
+and ws1.ws_ship_addr_sk = ca_address_sk
+and ca_state = 'TX'
+and ws1.ws_web_site_sk = web_site_sk
+and web_company_name = 'pri'
+and exists (select *
+from web_sales ws2
+where ws1.ws_order_number = ws2.ws_order_number
+  and ws1.ws_warehouse_sk <> ws2.ws_warehouse_sk)
+and not exists(select *
+   from web_returns wr1
+   where ws1.ws_order_number = wr1.wr_order_number)
+order by count(distinct ws_order_number)
+limit 100
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@customer_address
+POSTHOOK: Input: default@date_dim
+POSTHOOK: Input: default@web_returns
+POSTHOOK: Input: default@web_sales
+POSTHOOK: Input: default@web_site
+POSTHOOK: Output: hdfs://### HDFS PATH ###
+CBO PLAN:
+HiveAggregate(group=[{}], agg#0=[count(DISTINCT $4)], agg#1=[sum($5)], 
agg#2=[sum($6)])
+  HiveJoin(condition=[=($4, $14)], joinType=[anti], algorithm=[none], 
cost=[not available])

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 467717)
Time Spent: 16h 20m  (was: 16h 10m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 16h 20m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition.

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-08-07 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=467800&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-467800
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 07/Aug/20 10:35
Start Date: 07/Aug/20 10:35
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r466819149



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAntiSemiJoinRule.java
##
@@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite.rules;
+
+import org.apache.calcite.plan.RelOptRule;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.plan.RelOptUtil;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.rel.core.Filter;
+import org.apache.calcite.rel.core.Join;
+import org.apache.calcite.rel.core.JoinRelType;
+import org.apache.calcite.rel.core.Project;
+import org.apache.calcite.rex.RexNode;
+import org.apache.calcite.sql.SqlKind;
+import org.apache.hadoop.hive.ql.optimizer.calcite.HiveCalciteUtil;
+import org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveAntiJoin;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.List;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+/**
+ * Planner rule that converts a join plus filter to anti join.
+ */
+public class HiveAntiSemiJoinRule extends RelOptRule {
+  protected static final Logger LOG = 
LoggerFactory.getLogger(HiveAntiSemiJoinRule.class);
+  public static final HiveAntiSemiJoinRule INSTANCE = new 
HiveAntiSemiJoinRule();
+
+  //HiveProject(fld=[$0])
+  //  HiveFilter(condition=[IS NULL($1)])
+  //HiveJoin(condition=[=($0, $1)], joinType=[left], algorithm=[none], 
cost=[not available])
+  //
+  // TO
+  //
+  //HiveProject(fld_tbl=[$0])
+  //  HiveAntiJoin(condition=[=($0, $1)], joinType=[anti])
+  //
+  public HiveAntiSemiJoinRule() {
+super(operand(Project.class, operand(Filter.class, operand(Join.class, 
RelOptRule.any(,
+"HiveJoinWithFilterToAntiJoinRule:filter");
+  }
+
+  // is null filter over a left join.
+  public void onMatch(final RelOptRuleCall call) {
+final Project project = call.rel(0);
+final Filter filter = call.rel(1);
+final Join join = call.rel(2);
+perform(call, project, filter, join);
+  }
+
+  protected void perform(RelOptRuleCall call, Project project, Filter filter, 
Join join) {
+LOG.debug("Start Matching HiveAntiJoinRule");
+
+//TODO : Need to support this scenario.
+if (join.getCondition().isAlwaysTrue()) {
+  return;
+}
+
+//We support conversion from left outer join only.
+if (join.getJoinType() != JoinRelType.LEFT) {
+  return;
+}
+
+assert (filter != null);
+
+// If null filter is not present from right side then we can not convert 
to anti join.
+List aboveFilters = 
RelOptUtil.conjunctions(filter.getCondition());
+Stream nullFilters = aboveFilters.stream().filter(filterNode -> 
filterNode.getKind() == SqlKind.IS_NULL);
+boolean hasNullFilter = 
HiveCalciteUtil.hasAnyExpressionFromRightSide(join, 
nullFilters.collect(Collectors.toList()));
+if (!hasNullFilter) {
+  return;
+}
+
+// If any projection is there from right side, then we can not convert to 
anti join.
+boolean hasProjection = 
HiveCalciteUtil.hasAnyExpressionFromRightSide(join, project.getProjects());
+if (hasProjection) {
+  return;
+}
+
+LOG.debug("Matched HiveAntiJoinRule");
+
+// Build anti join with same left, right child and condition as original 
left outer join.
+Join anti = HiveAntiJoin.getAntiJoin(join.getLeft().getCluster(), 
join.getLeft().getTraitSet(),
+join.getLeft(), join.getRight(), join.getCondition());
+RelNode newProject = project.copy(project.getTraitSet(), anti, 
project.getProjects(), project.getRowType());
+call.transformTo(newProject);

Review comment

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-08-07 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=468138&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-468138
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 08/Aug/20 01:19
Start Date: 08/Aug/20 01:19
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r467343664



##
File path: 
ql/src/test/results/clientpositive/perf/tez/cbo_query16_anti_join.q.out
##
@@ -0,0 +1,99 @@
+PREHOOK: query: explain cbo
+select
+   count(distinct cs_order_number) as `order count`
+  ,sum(cs_ext_ship_cost) as `total shipping cost`
+  ,sum(cs_net_profit) as `total net profit`
+from
+   catalog_sales cs1
+  ,date_dim
+  ,customer_address
+  ,call_center
+where
+d_date between '2001-4-01' and
+   (cast('2001-4-01' as date) + 60 days)
+and cs1.cs_ship_date_sk = d_date_sk
+and cs1.cs_ship_addr_sk = ca_address_sk
+and ca_state = 'NY'
+and cs1.cs_call_center_sk = cc_call_center_sk
+and cc_county in ('Ziebach County','Levy County','Huron County','Franklin 
Parish',
+  'Daviess County'
+)
+and exists (select *
+from catalog_sales cs2
+where cs1.cs_order_number = cs2.cs_order_number
+  and cs1.cs_warehouse_sk <> cs2.cs_warehouse_sk)
+and not exists(select *
+   from catalog_returns cr1
+   where cs1.cs_order_number = cr1.cr_order_number)
+order by count(distinct cs_order_number)
+limit 100
+PREHOOK: type: QUERY
+PREHOOK: Input: default@call_center
+PREHOOK: Input: default@catalog_returns
+PREHOOK: Input: default@catalog_sales
+PREHOOK: Input: default@customer_address
+PREHOOK: Input: default@date_dim
+PREHOOK: Output: hdfs://### HDFS PATH ###
+POSTHOOK: query: explain cbo
+select
+   count(distinct cs_order_number) as `order count`
+  ,sum(cs_ext_ship_cost) as `total shipping cost`
+  ,sum(cs_net_profit) as `total net profit`
+from
+   catalog_sales cs1
+  ,date_dim
+  ,customer_address
+  ,call_center
+where
+d_date between '2001-4-01' and
+   (cast('2001-4-01' as date) + 60 days)
+and cs1.cs_ship_date_sk = d_date_sk
+and cs1.cs_ship_addr_sk = ca_address_sk
+and ca_state = 'NY'
+and cs1.cs_call_center_sk = cc_call_center_sk
+and cc_county in ('Ziebach County','Levy County','Huron County','Franklin 
Parish',
+  'Daviess County'
+)
+and exists (select *
+from catalog_sales cs2
+where cs1.cs_order_number = cs2.cs_order_number
+  and cs1.cs_warehouse_sk <> cs2.cs_warehouse_sk)
+and not exists(select *
+   from catalog_returns cr1
+   where cs1.cs_order_number = cr1.cr_order_number)
+order by count(distinct cs_order_number)
+limit 100
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@call_center
+POSTHOOK: Input: default@catalog_returns
+POSTHOOK: Input: default@catalog_sales
+POSTHOOK: Input: default@customer_address
+POSTHOOK: Input: default@date_dim
+POSTHOOK: Output: hdfs://### HDFS PATH ###
+CBO PLAN:
+HiveAggregate(group=[{}], agg#0=[count(DISTINCT $4)], agg#1=[sum($5)], 
agg#2=[sum($6)])
+  HiveJoin(condition=[=($4, $14)], joinType=[anti], algorithm=[none], 
cost=[not available])

Review comment:
   Do we have a JIRA to explore this optimization?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 468138)
Time Spent: 16h 40m  (was: 16.5h)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 16h 40m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. Thi

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-08-07 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=468139&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-468139
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 08/Aug/20 01:22
Start Date: 08/Aug/20 01:22
Worklog Time Spent: 10m 
  Work Description: jcamachor merged pull request #1147:
URL: https://github.com/apache/hive/pull/1147


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 468139)
Time Spent: 16h 50m  (was: 16h 40m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 16h 50m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-06-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=447831&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-447831
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 18/Jun/20 13:57
Start Date: 18/Jun/20 13:57
Worklog Time Spent: 10m 
  Work Description: maheshk114 opened a new pull request #1147:
URL: https://github.com/apache/hive/pull/1147


   ## NOTICE
   
   Please create an issue in ASF JIRA before opening a pull request,
   and you need to set the title of the pull request which starts with
   the corresponding JIRA issue number. (e.g. HIVE-X: Fix a typo in YYY)
   For more details, please see 
https://cwiki.apache.org/confluence/display/Hive/HowToContribute
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 447831)
Remaining Estimate: 0h
Time Spent: 10m

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-17 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=460414&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-460414
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 17/Jul/20 17:51
Start Date: 17/Jul/20 17:51
Worklog Time Spent: 10m 
  Work Description: vineetgarg02 commented on a change in pull request 
#1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r456588241



##
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##
@@ -2162,7 +2162,8 @@ private static void populateLlapDaemonVarsSet(Set 
llapDaemonVarsSetLocal
 "Whether Hive enables the optimization about converting common join 
into mapjoin based on the input file size. \n" +
 "If this parameter is on, and the sum of size for n-1 of the 
tables/partitions for a n-way join is smaller than the\n" +
 "specified size, the join is directly converted to a mapjoin (there is 
no conditional task)."),
-
+HIVE_CONVERT_ANTI_JOIN("hive.auto.convert.anti.join", false,

Review comment:
   @maheshk114 Have you run all the tests with this feature set to true by 
default? This change touches existing logic/code and we should definitely run 
all the existing tests with this set to TRUE.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 460414)
Time Spent: 20m  (was: 10m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-17 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=460416&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-460416
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 17/Jul/20 17:52
Start Date: 17/Jul/20 17:52
Worklog Time Spent: 10m 
  Work Description: vineetgarg02 commented on a change in pull request 
#1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r456588923



##
File path: ql/src/test/results/clientpositive/llap/antijoin.q.out
##
@@ -0,0 +1,1007 @@
+PREHOOK: query: create table t1_n55 as select cast(key as int) key, value from 
src where key <= 10
+PREHOOK: type: CREATETABLE_AS_SELECT
+PREHOOK: Input: default@src
+PREHOOK: Output: database:default
+PREHOOK: Output: default@t1_n55
+POSTHOOK: query: create table t1_n55 as select cast(key as int) key, value 
from src where key <= 10
+POSTHOOK: type: CREATETABLE_AS_SELECT
+POSTHOOK: Input: default@src
+POSTHOOK: Output: database:default
+POSTHOOK: Output: default@t1_n55
+POSTHOOK: Lineage: t1_n55.key EXPRESSION [(src)src.FieldSchema(name:key, 
type:string, comment:default), ]
+POSTHOOK: Lineage: t1_n55.value SIMPLE [(src)src.FieldSchema(name:value, 
type:string, comment:default), ]
+PREHOOK: query: select * from t1_n55 sort by key
+PREHOOK: type: QUERY
+PREHOOK: Input: default@t1_n55
+ A masked pattern was here 
+POSTHOOK: query: select * from t1_n55 sort by key
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@t1_n55
+ A masked pattern was here 
+0  val_0

Review comment:
   How was the correctness of results verified?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 460416)
Time Spent: 0.5h  (was: 20m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-17 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=460427&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-460427
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 17/Jul/20 18:01
Start Date: 17/Jul/20 18:01
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r456593908



##
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##
@@ -2162,7 +2162,8 @@ private static void populateLlapDaemonVarsSet(Set 
llapDaemonVarsSetLocal
 "Whether Hive enables the optimization about converting common join 
into mapjoin based on the input file size. \n" +
 "If this parameter is on, and the sum of size for n-1 of the 
tables/partitions for a n-way join is smaller than the\n" +
 "specified size, the join is directly converted to a mapjoin (there is 
no conditional task)."),
-
+HIVE_CONVERT_ANTI_JOIN("hive.auto.convert.anti.join", false,

Review comment:
   Yes, i had triggered a ptest run with this config enabled to true by 
default. There were some 26 failures. I had analyzed those and some fixes were 
done to make sure that the result is same for both and difference in plan is as 
expected. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 460427)
Time Spent: 40m  (was: 0.5h)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-23716) Support Anti Join in Hive

2020-07-17 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23716?focusedWorklogId=460429&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-460429
 ]

ASF GitHub Bot logged work on HIVE-23716:
-

Author: ASF GitHub Bot
Created on: 17/Jul/20 18:02
Start Date: 17/Jul/20 18:02
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #1147:
URL: https://github.com/apache/hive/pull/1147#discussion_r456594608



##
File path: ql/src/test/results/clientpositive/llap/antijoin.q.out
##
@@ -0,0 +1,1007 @@
+PREHOOK: query: create table t1_n55 as select cast(key as int) key, value from 
src where key <= 10
+PREHOOK: type: CREATETABLE_AS_SELECT
+PREHOOK: Input: default@src
+PREHOOK: Output: database:default
+PREHOOK: Output: default@t1_n55
+POSTHOOK: query: create table t1_n55 as select cast(key as int) key, value 
from src where key <= 10
+POSTHOOK: type: CREATETABLE_AS_SELECT
+POSTHOOK: Input: default@src
+POSTHOOK: Output: database:default
+POSTHOOK: Output: default@t1_n55
+POSTHOOK: Lineage: t1_n55.key EXPRESSION [(src)src.FieldSchema(name:key, 
type:string, comment:default), ]
+POSTHOOK: Lineage: t1_n55.value SIMPLE [(src)src.FieldSchema(name:value, 
type:string, comment:default), ]
+PREHOOK: query: select * from t1_n55 sort by key
+PREHOOK: type: QUERY
+PREHOOK: Input: default@t1_n55
+ A masked pattern was here 
+POSTHOOK: query: select * from t1_n55 sort by key
+POSTHOOK: type: QUERY
+POSTHOOK: Input: default@t1_n55
+ A masked pattern was here 
+0  val_0

Review comment:
   These all new test cases are added from the failure test cases of  a dry 
run with anti join enabled  true. Manually i have verified that the resultant 
records are same and plan difference is as per expected behavior. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 460429)
Time Spent: 50m  (was: 40m)

> Support Anti Join in Hive 
> --
>
> Key: HIVE-23716
> URL: https://issues.apache.org/jira/browse/HIVE-23716
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23716.01.patch
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Currently hive does not support Anti join. The query for anti join is 
> converted to left outer join and null filter on right side join key is added 
> to get the desired result. This is causing
>  # Extra computation — The left outer join projects the redundant columns 
> from right side. Along with that, filtering is done to remove the redundant 
> rows. This is can be avoided in case of anti join as anti join will project 
> only the required columns and rows from the left side table.
>  # Extra shuffle — In case of anti join the duplicate records moved to join 
> node can be avoided from the child node. This can reduce significant amount 
> of data movement if the number of distinct rows( join keys) is significant.
>  # Extra Memory Usage - In case of map based anti join , hash set is 
> sufficient as just the key is required to check  if the records matches the 
> join condition. In case of left join, we need the key and the non key columns 
> also and thus a hash table will be required.
> For a query like
> {code:java}
>  select wr_order_number FROM web_returns LEFT JOIN web_sales  ON 
> wr_order_number = ws_order_number WHERE ws_order_number IS NULL;{code}
> The number of distinct ws_order_number in web_sales table in a typical 10TB 
> TPCDS set up is just 10% of total records. So when we convert this query to 
> anti join, instead of 7 billion rows, only 600 million rows are moved to join 
> node.
> In the current patch, just one conversion is done. The pattern of 
> project->filter->left-join is converted to project->anti-join. This will take 
> care of sub queries with “not exists” clause. The queries with “not exists” 
> are converted first to filter + left-join and then its converted to anti 
> join. The queries with “not in” are not handled in the current patch.
> From execution side, both merge join and map join with vectorized execution  
> is supported for anti join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

1 2 >

1 - 100 of 101 matches

Mail list logo