[jira] [Work logged] (HIVE-24084) Enhance cost model to push down more Aggregates

2020-08-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24084?focusedWorklogId=475964=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-475964
 ]

ASF GitHub Bot logged work on HIVE-24084:
-

Author: ASF GitHub Bot
Created on: 28/Aug/20 20:13
Start Date: 28/Aug/20 20:13
Worklog Time Spent: 10m 
  Work Description: vineetgarg02 commented on a change in pull request 
#1439:
URL: https://github.com/apache/hive/pull/1439#discussion_r479508618



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/cost/HiveOnTezCostModel.java
##
@@ -89,22 +89,23 @@ public RelOptCost getAggregateCost(HiveAggregate aggregate) 
{
 } else {
   final RelMetadataQuery mq = aggregate.getCluster().getMetadataQuery();
   // 1. Sum of input cardinalities
-  final Double rCount = mq.getRowCount(aggregate.getInput());
-  if (rCount == null) {
+  final Double inputRowCount = mq.getRowCount(aggregate.getInput());
+  final Double rowCount = mq.getRowCount(aggregate);

Review comment:
   Can we change `rowCount` to `outputRowCount`? This will make the change 
more readable.

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/cost/HiveOnTezCostModel.java
##
@@ -89,22 +89,23 @@ public RelOptCost getAggregateCost(HiveAggregate aggregate) 
{
 } else {
   final RelMetadataQuery mq = aggregate.getCluster().getMetadataQuery();
   // 1. Sum of input cardinalities
-  final Double rCount = mq.getRowCount(aggregate.getInput());
-  if (rCount == null) {
+  final Double inputRowCount = mq.getRowCount(aggregate.getInput());
+  final Double rowCount = mq.getRowCount(aggregate);
+  if (inputRowCount == null || rowCount == null) {
 return null;
   }
   // 2. CPU cost = sorting cost
-  final double cpuCost = algoUtils.computeSortCPUCost(rCount);
+  final double cpuCost = algoUtils.computeSortCPUCost(rowCount) + 
inputRowCount * algoUtils.getCpuUnitCost();
   // 3. IO cost = cost of writing intermediary results to local FS +
   //  cost of reading from local FS for transferring to GBy +
   //  cost of transferring map outputs to GBy operator
   final Double rAverageSize = mq.getAverageRowSize(aggregate.getInput());
   if (rAverageSize == null) {
 return null;
   }
-  final double ioCost = algoUtils.computeSortIOCost(new 
Pair(rCount,rAverageSize));
+  final double ioCost = algoUtils.computeSortIOCost(new Pair(rowCount, rAverageSize));

Review comment:
   `rAverageSize` is based on input row count but `rowCount` is output row 
count. Is this intended or should average row size be computed based on output 
row count?

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##
@@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) {
 }
   }
 
+  /**
+   * Determines weather the give grouping is unique.
+   *
+   * Consider a join which might produce non-unique rows; but later the 
results are aggregated again.
+   * This method determines if there are sufficient columns in the grouping 
which have been present previously as unique column(s).
+   */
+  private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) {
+if (groups.isEmpty()) {
+  return false;
+}
+RelMetadataQuery mq = input.getCluster().getMetadataQuery();
+Set uKeys = mq.getUniqueKeys(input);

Review comment:
   If the purpose of this method is to determine that given a set of 
columns are unique or not you can use `areColumnsUnique` as @jcamachor  
suggested.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 475964)
Time Spent: 0.5h  (was: 20m)

> Enhance cost model to push down more Aggregates
> ---
>
> Key: HIVE-24084
> URL: https://issues.apache.org/jira/browse/HIVE-24084
> Project: Hive
>  Issue Type: Improvement
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24081) Enable pre-materializing CTEs referenced in scalar subqueries

2020-08-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24081?focusedWorklogId=475962=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-475962
 ]

ASF GitHub Bot logged work on HIVE-24081:
-

Author: ASF GitHub Bot
Created on: 28/Aug/20 19:47
Start Date: 28/Aug/20 19:47
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on a change in pull request #1437:
URL: https://github.com/apache/hive/pull/1437#discussion_r479498296



##
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##
@@ -2591,6 +2591,10 @@ private static void 
populateLlapDaemonVarsSet(Set llapDaemonVarsSetLocal
 HIVE_CTE_MATERIALIZE_THRESHOLD("hive.optimize.cte.materialize.threshold", 
-1,

Review comment:
   Should we set the default to `2` now that it is only triggered in very 
specific cases?

##
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/QBParseInfo.java
##
@@ -677,6 +685,33 @@ public void setNoScanAnalyzeCommand(boolean 
isNoScanAnalyzeCommand) {
   public boolean hasInsertTables() {
 return this.insertIntoTables.size() > 0 || 
this.insertOverwriteTables.size() > 0;
   }
+
+  public boolean isFullyAggregate() throws SemanticException {

Review comment:
   Although the method is evident, could we add a comment?
   
   Should this be static since it is a utility method that can be used beyond 
this scope?

##
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/QB.java
##
@@ -457,4 +469,17 @@ public boolean hasTableDefined() {
 return !(aliases.size() == 1 && 
aliases.get(0).equals(SemanticAnalyzer.DUMMY_TABLE));
   }
 
+  public void addSubqExprAlias(ASTNode expressionTree, SemanticAnalyzer 
semanticAnalyzer) throws SemanticException {
+String alias = "__subexpr" + subQueryExpressionAliasCounter++;
+
+// Recursively do the first phase of semantic analysis for the subquery
+QBExpr qbexpr = new QBExpr(alias);
+
+ASTNode subqref = (ASTNode) expressionTree.getChild(1);
+semanticAnalyzer.doPhase1QBExpr(subqref, qbexpr, getId(), alias, 
isInsideView());

Review comment:
   Trying to understand this step. Does this lead to parsing the same 
subquery multiple times?

##
File path: ql/src/test/queries/clientpositive/cte_mat_6.q
##
@@ -0,0 +1,81 @@
+set hive.optimize.cte.materialize.threshold=1;
+
+create table t0(col0 int);
+
+insert into t0(col0) values
+(1),(2),
+(100),(100),(100),
+(200),(200);
+
+-- CTE is referenced from scalar subquery in the select clause
+explain
+with cte as (select count(*) as small_count from t0 where col0 < 10)
+select t0.col0, (select small_count from cte)
+from t0
+order by t0.col0;
+
+with cte as (select count(*) as small_count from t0 where col0 < 10)
+select t0.col0, (select small_count from cte)
+from t0
+order by t0.col0;
+
+-- disable cte materialization
+set hive.optimize.cte.materialize.threshold=-1;
+
+explain
+with cte as (select count(*) as small_count from t0 where col0 < 10)
+select t0.col0, (select small_count from cte)
+from t0
+order by t0.col0;
+
+
+with cte as (select count(*) as small_count from t0 where col0 < 10)
+select t0.col0, (select small_count from cte)
+from t0
+order by t0.col0;
+
+
+-- enable cte materialization
+set hive.optimize.cte.materialize.threshold=1;
+
+-- CTE is referenced from scalar subquery in the where clause
+explain
+with cte as (select count(*) as small_count from t0 where col0 < 10)
+select t0.col0
+from t0
+where t0.col0 > (select small_count from cte)
+order by t0.col0;
+
+with cte as (select count(*) as small_count from t0 where col0 < 10)
+select t0.col0
+from t0
+where t0.col0 > (select small_count from cte)
+order by t0.col0;
+
+-- CTE is referenced from scalar subquery in the having clause
+explain
+with cte as (select count(*) as small_count from t0 where col0 < 10)
+select t0.col0, count(*)
+from t0
+group by col0
+having count(*) > (select small_count from cte)
+order by t0.col0;
+
+with cte as (select count(*) as small_count from t0 where col0 < 10)
+select t0.col0, count(*)
+from t0
+group by col0
+having count(*) > (select small_count from cte)
+order by t0.col0;
+
+-- mix full aggregate and non-full aggregate ctes
+explain
+with cte1 as (select col0 as k1 from t0 where col0 = '5'),
+ cte2 as (select count(*) as all_count from t0),
+ cte3 as (select col0 as k3, col0 + col0 as k3_2x, count(*) as key_count 
from t0 group by col0)
+select t0.col0, count(*)
+from t0
+join cte1 on t0.col0 = cte1.k1
+join cte3 on t0.col0 = cte3.k3
+group by col0
+having count(*) > (select all_count from cte2)

Review comment:
   Could we add tests to make sure the optimization is only triggered for 
SELECT queries? For instance, I am thinking about CTAS and CMV statements, the 
optimization should not triggered in those cases (I guess it could lead some 
kind of side effect, at least for CMV).





[jira] [Work logged] (HIVE-24084) Enhance cost model to push down more Aggregates

2020-08-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24084?focusedWorklogId=475953=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-475953
 ]

ASF GitHub Bot logged work on HIVE-24084:
-

Author: ASF GitHub Bot
Created on: 28/Aug/20 19:14
Start Date: 28/Aug/20 19:14
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on a change in pull request #1439:
URL: https://github.com/apache/hive/pull/1439#discussion_r479474959



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##
@@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) {
 }
   }
 
+  /**
+   * Determines weather the give grouping is unique.
+   *
+   * Consider a join which might produce non-unique rows; but later the 
results are aggregated again.
+   * This method determines if there are sufficient columns in the grouping 
which have been present previously as unique column(s).
+   */
+  private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) {
+if (groups.isEmpty()) {
+  return false;
+}
+RelMetadataQuery mq = input.getCluster().getMetadataQuery();
+Set uKeys = mq.getUniqueKeys(input);

Review comment:
   We could rely on `mq.areColumnsUnique`.

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/cost/HiveOnTezCostModel.java
##
@@ -89,22 +89,23 @@ public RelOptCost getAggregateCost(HiveAggregate aggregate) 
{
 } else {
   final RelMetadataQuery mq = aggregate.getCluster().getMetadataQuery();
   // 1. Sum of input cardinalities
-  final Double rCount = mq.getRowCount(aggregate.getInput());
-  if (rCount == null) {
+  final Double inputRowCount = mq.getRowCount(aggregate.getInput());
+  final Double rowCount = mq.getRowCount(aggregate);
+  if (inputRowCount == null || rowCount == null) {
 return null;
   }
   // 2. CPU cost = sorting cost
-  final double cpuCost = algoUtils.computeSortCPUCost(rCount);
+  final double cpuCost = algoUtils.computeSortCPUCost(rowCount) + 
inputRowCount * algoUtils.getCpuUnitCost();

Review comment:
   Not sure about this change. If the algorithm is sort-based, you will 
still sort the complete input, right?

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##
@@ -303,6 +305,90 @@ public void onMatch(RelOptRuleCall call) {
 }
   }
 
+  /**
+   * Determines weather the give grouping is unique.
+   *
+   * Consider a join which might produce non-unique rows; but later the 
results are aggregated again.
+   * This method determines if there are sufficient columns in the grouping 
which have been present previously as unique column(s).
+   */
+  private boolean isGroupingUnique(RelNode input, ImmutableBitSet groups) {
+if (groups.isEmpty()) {
+  return false;
+}
+RelMetadataQuery mq = input.getCluster().getMetadataQuery();
+Set uKeys = mq.getUniqueKeys(input);
+for (ImmutableBitSet u : uKeys) {
+  if (groups.contains(u)) {
+return true;
+  }
+}
+if (input instanceof Join) {
+  Join join = (Join) input;
+  RexBuilder rexBuilder = input.getCluster().getRexBuilder();
+  SimpleConditionInfo cond = new SimpleConditionInfo(join.getCondition(), 
rexBuilder);
+
+  if (cond.valid) {
+ImmutableBitSet newGroup = 
groups.intersect(ImmutableBitSet.fromBitSet(cond.fields));
+RelNode l = join.getLeft();
+RelNode r = join.getRight();
+
+int joinFieldCount = join.getRowType().getFieldCount();
+int lFieldCount = l.getRowType().getFieldCount();
+
+ImmutableBitSet groupL = newGroup.get(0, lFieldCount);
+ImmutableBitSet groupR = newGroup.get(lFieldCount, 
joinFieldCount).shift(-lFieldCount);
+
+if (isGroupingUnique(l, groupL)) {

Review comment:
   This could call `mq.areColumnsUnique` instead of making the recursive 
call.

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##
@@ -290,7 +291,8 @@ public void onMatch(RelOptRuleCall call) {
   RelNode r = relBuilder.build();
   RelOptCost afterCost = mq.getCumulativeCost(r);
   RelOptCost beforeCost = mq.getCumulativeCost(aggregate);
-  if (afterCost.isLt(beforeCost)) {

Review comment:
   I think you suggested changing this... Maybe `isLe` if we do not 
introduce an additional aggregate on top?

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateJoinTransposeRule.java
##
@@ -290,7 +291,8 @@ public void onMatch(RelOptRuleCall call) {
   RelNode r = relBuilder.build();
   RelOptCost afterCost = mq.getCumulativeCost(r);
   RelOptCost beforeCost = mq.getCumulativeCost(aggregate);

[jira] [Work logged] (HIVE-22782) Consolidate metastore call to fetch constraints

2020-08-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-22782?focusedWorklogId=475938=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-475938
 ]

ASF GitHub Bot logged work on HIVE-22782:
-

Author: ASF GitHub Bot
Created on: 28/Aug/20 18:27
Start Date: 28/Aug/20 18:27
Worklog Time Spent: 10m 
  Work Description: sankarh commented on a change in pull request #1419:
URL: https://github.com/apache/hive/pull/1419#discussion_r479451533



##
File path: 
standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.java
##
@@ -2811,6 +2811,26 @@ public GetFieldsResponse 
getFieldsRequest(GetFieldsRequest req)
 return client.get_check_constraints(req).getCheckConstraints();
   }
 
+  @Override
+  public SQLAllTableConstraints 
getAllTableConstraints(AllTableConstraintsRequest req)
+  throws MetaException, NoSuchObjectException, TException {
+long t1 = System.currentTimeMillis();
+
+try {
+  if (!req.isSetCatName()) {
+req.setCatName(getDefaultCatalog(conf));
+  }
+
+  return client.get_all_table_constraints(req).getAllTableConstraints();
+} finally {
+  long diff = System.currentTimeMillis() - t1;

Review comment:
   I think, it is redundant as HMS also logs the time taken by 
get_all_table_constraints api. 

##
File path: 
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/RawStore.java
##
@@ -1485,6 +1485,16 @@ void getFileMetadataByExpr(List fileIds, 
FileMetadataExprType type, byte[]
   List getCheckConstraints(String catName, String db_name,
String tbl_name) throws 
MetaException;
 
+  /**
+   *  Get all constraints of the table
+   * @param catName catalog name
+   * @param db_name database name
+   * @param tbl_name table name
+   * @return all constraints for this table
+   * @throws MetaException error accessing the RDBMS
+   */
+  SQLAllTableConstraints getAllTableConstraints(String catName, String 
db_name, String tbl_name)

Review comment:
   Follow uniform naming style for arguments.

##
File path: 
standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/DummyRawStoreForJdoConnection.java
##
@@ -1014,6 +1015,11 @@ public FileMetadataHandler 
getFileMetadataHandler(FileMetadataExprType type) {
 return null;
   }
 
+  @Override public SQLAllTableConstraints getAllTableConstraints(String 
catName, String db_name, String tbl_name)

Review comment:
   Nit: Annotation and api signature can be in separate lines.

##
File path: 
standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.java
##
@@ -2811,6 +2811,26 @@ public GetFieldsResponse 
getFieldsRequest(GetFieldsRequest req)
 return client.get_check_constraints(req).getCheckConstraints();
   }
 
+  @Override
+  public SQLAllTableConstraints 
getAllTableConstraints(AllTableConstraintsRequest req)
+  throws MetaException, NoSuchObjectException, TException {
+long t1 = System.currentTimeMillis();
+
+try {
+  if (!req.isSetCatName()) {

Review comment:
   HMS api handles this default catalog name flow. Shall remove it here.

##
File path: 
standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/client/TestGetAllTableConstraints.java
##
@@ -0,0 +1,382 @@
+package org.apache.hadoop.hive.metastore.client;
+
+import org.apache.hadoop.hive.metastore.IMetaStoreClient;
+import org.apache.hadoop.hive.metastore.MetaStoreTestUtils;
+import org.apache.hadoop.hive.metastore.annotation.MetastoreCheckinTest;
+import org.apache.hadoop.hive.metastore.api.AllTableConstraintsRequest;
+import org.apache.hadoop.hive.metastore.api.Catalog;
+import org.apache.hadoop.hive.metastore.api.Database;
+import org.apache.hadoop.hive.metastore.api.NoSuchObjectException;
+import org.apache.hadoop.hive.metastore.api.PrimaryKeysRequest;
+import org.apache.hadoop.hive.metastore.api.SQLAllTableConstraints;
+import org.apache.hadoop.hive.metastore.api.SQLCheckConstraint;
+import org.apache.hadoop.hive.metastore.api.SQLDefaultConstraint;
+import org.apache.hadoop.hive.metastore.api.SQLForeignKey;
+import org.apache.hadoop.hive.metastore.api.SQLNotNullConstraint;
+import org.apache.hadoop.hive.metastore.api.SQLPrimaryKey;
+import org.apache.hadoop.hive.metastore.api.SQLUniqueConstraint;
+import org.apache.hadoop.hive.metastore.api.Table;
+import org.apache.hadoop.hive.metastore.client.builder.CatalogBuilder;
+import org.apache.hadoop.hive.metastore.client.builder.DatabaseBuilder;
+import 
org.apache.hadoop.hive.metastore.client.builder.SQLCheckConstraintBuilder;
+import 
org.apache.hadoop.hive.metastore.client.builder.SQLDefaultConstraintBuilder;
+import 

[jira] [Commented] (HIVE-22622) Hive allows to create a struct with duplicate attribute names

2020-08-28 Thread Jesus Camacho Rodriguez (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-22622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17186750#comment-17186750
 ] 

Jesus Camacho Rodriguez commented on HIVE-22622:


I think [~zabetak] has faced similar issue. I think field uniqueness should be 
checked indeed, otherwise I am not sure how things work.

> Hive allows to create a struct with duplicate attribute names
> -
>
> Key: HIVE-22622
> URL: https://issues.apache.org/jira/browse/HIVE-22622
> Project: Hive
>  Issue Type: Bug
>Reporter: Denys Kuzmenko
>Assignee: Krisztian Kasa
>Priority: Major
>
> When you create at table with a struct with twice the same attribute name, 
> hive allow you to create it.
> create table test_struct( duplicateColumn struct);
> You can insert data into it :
> insert into test_struct select named_struct("id",1,"id",1);
> But you can not read it :
> select * from test_struct;
> Return : java.io.IOException: java.io.IOException: Error reading file: 
> hdfs://.../test_struct/delta_001_001_/bucket_0 ,
> We can create and insert. but fail on read the Struct part of the tables. We 
> can still read all other columns (if we have more than one) but not the 
> struct anymore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-24065) Bloom filters can be cached after deserialization in VectorInBloomFilterColDynamicValue

2020-08-28 Thread Ashutosh Chauhan (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17186736#comment-17186736
 ] 

Ashutosh Chauhan commented on HIVE-24065:
-

+1

> Bloom filters can be cached after deserialization in 
> VectorInBloomFilterColDynamicValue
> ---
>
> Key: HIVE-24065
> URL: https://issues.apache.org/jira/browse/HIVE-24065
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2020-08-05-10-05-25-080.png
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Same bloom filter is loaded multiple times across tasks. It would be good to 
> check if we can optimise this, to avoid deserializing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24087) FK side join elimination in presence of PK-FK constraint

2020-08-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24087?focusedWorklogId=475925=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-475925
 ]

ASF GitHub Bot logged work on HIVE-24087:
-

Author: ASF GitHub Bot
Created on: 28/Aug/20 17:49
Start Date: 28/Aug/20 17:49
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on a change in pull request #1440:
URL: https://github.com/apache/hive/pull/1440#discussion_r479451432



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveRelOptUtil.java
##
@@ -75,6 +75,7 @@
 import org.apache.hadoop.hive.ql.optimizer.calcite.translator.TypeConverter;
 import org.apache.hadoop.hive.serde2.typeinfo.TypeInfo;
 import org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils;
+import org.apache.parquet.Preconditions;

Review comment:
   nit. Use guava preconditions instead of Parquet.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 475925)
Time Spent: 1h  (was: 50m)

> FK side join elimination in presence of PK-FK constraint
> 
>
> Key: HIVE-24087
> URL: https://issues.apache.org/jira/browse/HIVE-24087
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Planning
>Reporter: Vineet Garg
>Assignee: Vineet Garg
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> If there is PK-FK join FK join could be eliminated by removing FK side if 
> following conditions are met
> * There is no row filtering on FK side.
> * No columns from FK side is required after JOIN.
> * FK join columns are guranteed to be unique (have group by)
> * FK join columns are guranteed to be NOT NULL (either IS NOT NULL filter or 
> constraint)
> *Example*
> {code:sql}
> EXPLAIN 
> SELECT customer_removal_n0.*
> FROM customer_removal_n0
> JOIN
> (SELECT lo_custkey
> FROM lineorder_removal_n0
> WHERE lo_custkey IS NOT NULL
> GROUP BY lo_custkey) fkSide ON fkSide.lo_custkey = 
> customer_removal_n0.c_custkey;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-23938) LLAP: JDK11 - some GC log file rotation related jvm arguments cannot be used anymore

2020-08-28 Thread Ashutosh Chauhan (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-23938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17186735#comment-17186735
 ] 

Ashutosh Chauhan commented on HIVE-23938:
-

[~abstractdog] these new args work on both jdk8 and jdk11 ? or we need seperate 
args for 8 vs 11 ?

> LLAP: JDK11 - some GC log file rotation related jvm arguments cannot be used 
> anymore
> 
>
> Key: HIVE-23938
> URL: https://issues.apache.org/jira/browse/HIVE-23938
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
>  Labels: pull-request-available
> Attachments: gc_2020-07-27-13.log, gc_2020-07-29-12.jdk8.log
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> https://github.com/apache/hive/blob/master/llap-server/bin/runLlapDaemon.sh#L55
> {code}
> JAVA_OPTS_BASE="-server -Djava.net.preferIPv4Stack=true -XX:+UseNUMA 
> -XX:+PrintGCDetails -verbose:gc -XX:+UseGCLogFileRotation 
> -XX:NumberOfGCLogFiles=4 -XX:GCLogFileSize=100M -XX:+PrintGCDateStamps"
> {code}
> on JDK11 I got something like:
> {code}
> + exec /usr/lib/jvm/jre-11-openjdk/bin/java -Dproc_llapdaemon -Xms32000m 
> -Xmx64000m -Dhttp.maxConnections=17 -XX:+UseG1GC -XX:+ResizeTLAB -XX:+UseNUMA 
> -XX:+AggressiveOpts -XX:MetaspaceSize=1024m 
> -XX:InitiatingHeapOccupancyPercent=80 -XX:MaxGCPauseMillis=200 
> -XX:+PreserveFramePointer -XX:AllocatePrefetchStyle=2 
> -Dhttp.maxConnections=10 -Dasync.profiler.home=/grid/0/async-profiler -server 
> -Djava.net.preferIPv4Stack=true -XX:+UseNUMA -XX:+PrintGCDetails -verbose:gc 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=4 -XX:GCLogFileSize=100M 
> -XX:+PrintGCDateStamps 
> -Xloggc:/grid/2/yarn/container-logs/application_1595375468459_0113/container_e26_1595375468459_0113_01_09/gc_2020-07-27-12.log
>  
> ... 
> org.apache.hadoop.hive.llap.daemon.impl.LlapDaemon
> OpenJDK 64-Bit Server VM warning: Option AggressiveOpts was deprecated in 
> version 11.0 and will likely be removed in a future release.
> Unrecognized VM option 'UseGCLogFileRotation'
> Error: Could not create the Java Virtual Machine.
> Error: A fatal exception has occurred. Program will exit.
> {code}
> These are not valid in JDK11:
> {code}
> -XX:+UseGCLogFileRotation
> -XX:NumberOfGCLogFiles
> -XX:GCLogFileSize
> -XX:+PrintGCTimeStamps
> -XX:+PrintGCDateStamps
> {code}
> Instead something like:
> {code}
> -Xlog:gc*,safepoint:gc.log:time,uptime:filecount=4,filesize=100M
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24087) FK side join elimination in presence of PK-FK constraint

2020-08-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24087?focusedWorklogId=475907=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-475907
 ]

ASF GitHub Bot logged work on HIVE-24087:
-

Author: ASF GitHub Bot
Created on: 28/Aug/20 17:06
Start Date: 28/Aug/20 17:06
Worklog Time Spent: 10m 
  Work Description: vineetgarg02 commented on pull request #1440:
URL: https://github.com/apache/hive/pull/1440#issuecomment-682920806


   @jcamachor Thanks for the suggestions. I will add all three tests (and yes 
all of these cases are expected to trigger the rewrite)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 475907)
Time Spent: 50m  (was: 40m)

> FK side join elimination in presence of PK-FK constraint
> 
>
> Key: HIVE-24087
> URL: https://issues.apache.org/jira/browse/HIVE-24087
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Planning
>Reporter: Vineet Garg
>Assignee: Vineet Garg
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> If there is PK-FK join FK join could be eliminated by removing FK side if 
> following conditions are met
> * There is no row filtering on FK side.
> * No columns from FK side is required after JOIN.
> * FK join columns are guranteed to be unique (have group by)
> * FK join columns are guranteed to be NOT NULL (either IS NOT NULL filter or 
> constraint)
> *Example*
> {code:sql}
> EXPLAIN 
> SELECT customer_removal_n0.*
> FROM customer_removal_n0
> JOIN
> (SELECT lo_custkey
> FROM lineorder_removal_n0
> WHERE lo_custkey IS NOT NULL
> GROUP BY lo_custkey) fkSide ON fkSide.lo_custkey = 
> customer_removal_n0.c_custkey;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24087) FK side join elimination in presence of PK-FK constraint

2020-08-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24087?focusedWorklogId=475906=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-475906
 ]

ASF GitHub Bot logged work on HIVE-24087:
-

Author: ASF GitHub Bot
Created on: 28/Aug/20 17:05
Start Date: 28/Aug/20 17:05
Worklog Time Spent: 10m 
  Work Description: vineetgarg02 commented on a change in pull request 
#1440:
URL: https://github.com/apache/hive/pull/1440#discussion_r479430314



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinConstraintsRule.java
##
@@ -213,61 +218,137 @@ public void onMatch(RelOptRuleCall call) {
 
 // 2) Check whether this join can be rewritten or removed
 RewritablePKFKJoinInfo r = HiveRelOptUtil.isRewritablePKFKJoin(
-join, leftInput == fkInput, call.getMetadataQuery());
+join, fkInput,  nonFkInput, call.getMetadataQuery());
 
 // 3) If it is the only condition, we can trigger the rewriting
 if (r.rewritable) {
-  List nullableNodes = r.nullableNodes;
-  // If we reach here, we trigger the transform
-  if (mode == Mode.REMOVE) {
-if (rightInputPotentialFK) {
-  // First, if FK is the right input, we need to shift
-  nullableNodes = nullableNodes.stream()
-  .map(node -> RexUtil.shift(node, 0, 
-leftInput.getRowType().getFieldCount()))
-  .collect(Collectors.toList());
-  topProjExprs = topProjExprs.stream()
-  .map(node -> RexUtil.shift(node, 0, 
-leftInput.getRowType().getFieldCount()))
-  .collect(Collectors.toList());
-}
-// Fix nullability in references to the input node
-topProjExprs = HiveCalciteUtil.fixNullability(rexBuilder, 
topProjExprs, RelOptUtil.getFieldTypeList(fkInput.getRowType()));
-// Trigger transformation
-if (nullableNodes.isEmpty()) {
-  call.transformTo(call.builder()
-  .push(fkInput)
-  .project(topProjExprs)
-  .convert(project.getRowType(), false)
-  .build());
+  rewrite(mode, fkInput, nonFkInput, join, topProjExprs, call, project, 
r.nullableNodes);
+} else {
+  // check if FK side could be removed instead
+
+  // Possibly this could be enhanced to take other join type into 
consideration.
+  if (joinType != JoinRelType.INNER) {
+return;
+  }
+
+  //first swap fk and non-fk input and see if we can rewrite them
+  RewritablePKFKJoinInfo fkRemoval = HiveRelOptUtil.isRewritablePKFKJoin(
+  join, nonFkInput, fkInput, call.getMetadataQuery());
+
+  if (fkRemoval.rewritable) {
+// we have established that nonFkInput is FK, and fkInput is PK
+// and there is no row filtering on FK side
+
+// check that FK side join column is distinct (i.e. have a group by)
+ImmutableBitSet fkSideBitSet;
+if (nonFkInput == leftInput) {
+  fkSideBitSet = leftBits;
 } else {
-  RexNode newFilterCond;
-  if (nullableNodes.size() == 1) {
-newFilterCond = 
rexBuilder.makeCall(SqlStdOperatorTable.IS_NOT_NULL, nullableNodes.get(0));
-  } else {
-List isNotNullConds = new ArrayList<>();
-for (RexNode nullableNode : nullableNodes) {
-  
isNotNullConds.add(rexBuilder.makeCall(SqlStdOperatorTable.IS_NOT_NULL, 
nullableNode));
+  fkSideBitSet = rightBits;
+}
+
+ImmutableBitSet.Builder fkJoinColBuilder = ImmutableBitSet.builder();
+for (RexNode conj : RelOptUtil.conjunctions(cond)) {
+  if (!conj.isA(SqlKind.EQUALS)) {
+continue;

Review comment:
   @kgyrtkirk If there is any other kind of predicate/condition 
`isRewritablePKFKJoin` will return false. But you are right that the code here 
should return instead of continue. I will update the code. Thanks for pointing 
it out.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 475906)
Time Spent: 40m  (was: 0.5h)

> FK side join elimination in presence of PK-FK constraint
> 
>
> Key: HIVE-24087
> URL: https://issues.apache.org/jira/browse/HIVE-24087
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Planning
>Reporter: Vineet Garg
>Assignee: Vineet Garg
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> If there 

[jira] [Work logged] (HIVE-24087) FK side join elimination in presence of PK-FK constraint

2020-08-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24087?focusedWorklogId=475870=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-475870
 ]

ASF GitHub Bot logged work on HIVE-24087:
-

Author: ASF GitHub Bot
Created on: 28/Aug/20 15:39
Start Date: 28/Aug/20 15:39
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on pull request #1440:
URL: https://github.com/apache/hive/pull/1440#issuecomment-682729013


   Apart from the tests mentioned above, we should add a test where the 
aggregate contains a SUM to see whether that works correctly too (that's the 
pattern seen in the original query).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 475870)
Time Spent: 0.5h  (was: 20m)

> FK side join elimination in presence of PK-FK constraint
> 
>
> Key: HIVE-24087
> URL: https://issues.apache.org/jira/browse/HIVE-24087
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Planning
>Reporter: Vineet Garg
>Assignee: Vineet Garg
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> If there is PK-FK join FK join could be eliminated by removing FK side if 
> following conditions are met
> * There is no row filtering on FK side.
> * No columns from FK side is required after JOIN.
> * FK join columns are guranteed to be unique (have group by)
> * FK join columns are guranteed to be NOT NULL (either IS NOT NULL filter or 
> constraint)
> *Example*
> {code:sql}
> EXPLAIN 
> SELECT customer_removal_n0.*
> FROM customer_removal_n0
> JOIN
> (SELECT lo_custkey
> FROM lineorder_removal_n0
> WHERE lo_custkey IS NOT NULL
> GROUP BY lo_custkey) fkSide ON fkSide.lo_custkey = 
> customer_removal_n0.c_custkey;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-24074) Incorrect handling of timestamp in Parquet/Avro when written in certain time zones in versions before Hive 3.x

2020-08-28 Thread Jesus Camacho Rodriguez (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jesus Camacho Rodriguez updated HIVE-24074:
---
Fix Version/s: 4.0.0
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Pushed to master.

> Incorrect handling of timestamp in Parquet/Avro when written in certain time 
> zones in versions before Hive 3.x
> --
>
> Key: HIVE-24074
> URL: https://issues.apache.org/jira/browse/HIVE-24074
> Project: Hive
>  Issue Type: Bug
>  Components: Avro, Parquet
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The timezone conversion for Parquet and Avro uses new {{java.time.\*}} 
> classes, which can lead to incorrect values returned for certain dates in 
> certain timezones if timestamp was computed and converted based on 
> {{java.sql.\*}} classes. For instance, the offset used for Singapore timezone 
> in 1900-01-01T00:00:00.000 is UTC+8, while the correct offset for that date 
> should be UTC+6:55:25. Some additional information can be found here: 
> https://stackoverflow.com/a/52152315



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-24074) Incorrect handling of timestamp in Parquet/Avro when written in certain time zones in versions before Hive 3.x

2020-08-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-24074:
--
Labels: pull-request-available  (was: )

> Incorrect handling of timestamp in Parquet/Avro when written in certain time 
> zones in versions before Hive 3.x
> --
>
> Key: HIVE-24074
> URL: https://issues.apache.org/jira/browse/HIVE-24074
> Project: Hive
>  Issue Type: Bug
>  Components: Avro, Parquet
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The timezone conversion for Parquet and Avro uses new {{java.time.\*}} 
> classes, which can lead to incorrect values returned for certain dates in 
> certain timezones if timestamp was computed and converted based on 
> {{java.sql.\*}} classes. For instance, the offset used for Singapore timezone 
> in 1900-01-01T00:00:00.000 is UTC+8, while the correct offset for that date 
> should be UTC+6:55:25. Some additional information can be found here: 
> https://stackoverflow.com/a/52152315



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24074) Incorrect handling of timestamp in Parquet/Avro when written in certain time zones in versions before Hive 3.x

2020-08-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24074?focusedWorklogId=475857=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-475857
 ]

ASF GitHub Bot logged work on HIVE-24074:
-

Author: ASF GitHub Bot
Created on: 28/Aug/20 15:17
Start Date: 28/Aug/20 15:17
Worklog Time Spent: 10m 
  Work Description: jcamachor merged pull request #1392:
URL: https://github.com/apache/hive/pull/1392


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 475857)
Remaining Estimate: 0h
Time Spent: 10m

> Incorrect handling of timestamp in Parquet/Avro when written in certain time 
> zones in versions before Hive 3.x
> --
>
> Key: HIVE-24074
> URL: https://issues.apache.org/jira/browse/HIVE-24074
> Project: Hive
>  Issue Type: Bug
>  Components: Avro, Parquet
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The timezone conversion for Parquet and Avro uses new {{java.time.\*}} 
> classes, which can lead to incorrect values returned for certain dates in 
> certain timezones if timestamp was computed and converted based on 
> {{java.sql.\*}} classes. For instance, the offset used for Singapore timezone 
> in 1900-01-01T00:00:00.000 is UTC+8, while the correct offset for that date 
> should be UTC+6:55:25. Some additional information can be found here: 
> https://stackoverflow.com/a/52152315



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24087) FK side join elimination in presence of PK-FK constraint

2020-08-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24087?focusedWorklogId=475841=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-475841
 ]

ASF GitHub Bot logged work on HIVE-24087:
-

Author: ASF GitHub Bot
Created on: 28/Aug/20 14:32
Start Date: 28/Aug/20 14:32
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on a change in pull request #1440:
URL: https://github.com/apache/hive/pull/1440#discussion_r479342835



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinConstraintsRule.java
##
@@ -213,61 +218,137 @@ public void onMatch(RelOptRuleCall call) {
 
 // 2) Check whether this join can be rewritten or removed
 RewritablePKFKJoinInfo r = HiveRelOptUtil.isRewritablePKFKJoin(
-join, leftInput == fkInput, call.getMetadataQuery());
+join, fkInput,  nonFkInput, call.getMetadataQuery());
 
 // 3) If it is the only condition, we can trigger the rewriting
 if (r.rewritable) {
-  List nullableNodes = r.nullableNodes;
-  // If we reach here, we trigger the transform
-  if (mode == Mode.REMOVE) {
-if (rightInputPotentialFK) {
-  // First, if FK is the right input, we need to shift
-  nullableNodes = nullableNodes.stream()
-  .map(node -> RexUtil.shift(node, 0, 
-leftInput.getRowType().getFieldCount()))
-  .collect(Collectors.toList());
-  topProjExprs = topProjExprs.stream()
-  .map(node -> RexUtil.shift(node, 0, 
-leftInput.getRowType().getFieldCount()))
-  .collect(Collectors.toList());
-}
-// Fix nullability in references to the input node
-topProjExprs = HiveCalciteUtil.fixNullability(rexBuilder, 
topProjExprs, RelOptUtil.getFieldTypeList(fkInput.getRowType()));
-// Trigger transformation
-if (nullableNodes.isEmpty()) {
-  call.transformTo(call.builder()
-  .push(fkInput)
-  .project(topProjExprs)
-  .convert(project.getRowType(), false)
-  .build());
+  rewrite(mode, fkInput, nonFkInput, join, topProjExprs, call, project, 
r.nullableNodes);
+} else {
+  // check if FK side could be removed instead
+
+  // Possibly this could be enhanced to take other join type into 
consideration.
+  if (joinType != JoinRelType.INNER) {
+return;
+  }
+
+  //first swap fk and non-fk input and see if we can rewrite them
+  RewritablePKFKJoinInfo fkRemoval = HiveRelOptUtil.isRewritablePKFKJoin(
+  join, nonFkInput, fkInput, call.getMetadataQuery());
+
+  if (fkRemoval.rewritable) {
+// we have established that nonFkInput is FK, and fkInput is PK
+// and there is no row filtering on FK side
+
+// check that FK side join column is distinct (i.e. have a group by)
+ImmutableBitSet fkSideBitSet;
+if (nonFkInput == leftInput) {
+  fkSideBitSet = leftBits;
 } else {
-  RexNode newFilterCond;
-  if (nullableNodes.size() == 1) {
-newFilterCond = 
rexBuilder.makeCall(SqlStdOperatorTable.IS_NOT_NULL, nullableNodes.get(0));
-  } else {
-List isNotNullConds = new ArrayList<>();
-for (RexNode nullableNode : nullableNodes) {
-  
isNotNullConds.add(rexBuilder.makeCall(SqlStdOperatorTable.IS_NOT_NULL, 
nullableNode));
+  fkSideBitSet = rightBits;
+}
+
+ImmutableBitSet.Builder fkJoinColBuilder = ImmutableBitSet.builder();
+for (RexNode conj : RelOptUtil.conjunctions(cond)) {
+  if (!conj.isA(SqlKind.EQUALS)) {
+continue;

Review comment:
   why do we skip all other kinds which are not `EQUALS`?
   I think instead there should be a return here instead of a continue





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 475841)
Time Spent: 20m  (was: 10m)

> FK side join elimination in presence of PK-FK constraint
> 
>
> Key: HIVE-24087
> URL: https://issues.apache.org/jira/browse/HIVE-24087
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Planning
>Reporter: Vineet Garg
>Assignee: Vineet Garg
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> If there is PK-FK join FK join could be eliminated by removing FK side if 
> following conditions are met
> * There 

[jira] [Work logged] (HIVE-24089) Run QB compaction as table directory user with impersonation

2020-08-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24089?focusedWorklogId=475824=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-475824
 ]

ASF GitHub Bot logged work on HIVE-24089:
-

Author: ASF GitHub Bot
Created on: 28/Aug/20 13:17
Start Date: 28/Aug/20 13:17
Worklog Time Spent: 10m 
  Work Description: klcopp opened a new pull request #1441:
URL: https://github.com/apache/hive/pull/1441


   Currently QB compaction runs as the session user, unlike MR compaction which 
runs as the table/partition directory owner (see 
CompactorThread#findUserToRunAs).
   
   We should make QB compaction run as the table/partition directory owner and 
enable user impersonation during compaction to avoid any issues with temp 
directories.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 475824)
Remaining Estimate: 0h
Time Spent: 10m

> Run QB compaction as table directory user with impersonation
> 
>
> Key: HIVE-24089
> URL: https://issues.apache.org/jira/browse/HIVE-24089
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently QB compaction runs as the session user, unlike MR compaction which 
> runs as the table/partition directory owner (see 
> CompactorThread#findUserToRunAs).
> We should make QB compaction run as the table/partition directory owner and 
> enable user impersonation during compaction to avoid any issues with temp 
> directories.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-24089) Run QB compaction as table directory user with impersonation

2020-08-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-24089:
--
Labels: pull-request-available  (was: )

> Run QB compaction as table directory user with impersonation
> 
>
> Key: HIVE-24089
> URL: https://issues.apache.org/jira/browse/HIVE-24089
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently QB compaction runs as the session user, unlike MR compaction which 
> runs as the table/partition directory owner (see 
> CompactorThread#findUserToRunAs).
> We should make QB compaction run as the table/partition directory owner and 
> enable user impersonation during compaction to avoid any issues with temp 
> directories.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-24089) Run QB compaction as table directory user with impersonation

2020-08-28 Thread Karen Coppage (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karen Coppage reassigned HIVE-24089:



> Run QB compaction as table directory user with impersonation
> 
>
> Key: HIVE-24089
> URL: https://issues.apache.org/jira/browse/HIVE-24089
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>
> Currently QB compaction runs as the session user, unlike MR compaction which 
> runs as the table/partition directory owner (see 
> CompactorThread#findUserToRunAs).
> We should make QB compaction run as the table/partition directory owner and 
> enable user impersonation during compaction to avoid any issues with temp 
> directories.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-18284) NPE when inserting data with 'distribute by' clause with dynpart sort optimization

2020-08-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-18284?focusedWorklogId=475813=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-475813
 ]

ASF GitHub Bot logged work on HIVE-18284:
-

Author: ASF GitHub Bot
Created on: 28/Aug/20 12:18
Start Date: 28/Aug/20 12:18
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on a change in pull request #1400:
URL: https://github.com/apache/hive/pull/1400#discussion_r479216917



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplicationUtils.java
##
@@ -181,6 +183,23 @@ public static boolean merge(HiveConf hiveConf, 
ReduceSinkOperator cRS, ReduceSin
 TableDesc keyTable = PlanUtils.getReduceKeyTableDesc(new 
ArrayList(), pRS
 .getConf().getOrder(), pRS.getConf().getNullOrder());
 pRS.getConf().setKeySerializeInfo(keyTable);
+  } else if (cRS.getConf().getKeyCols() != null && 
cRS.getConf().getKeyCols().size() > 0) {

Review comment:
   don't we need any conditional on `pRS` here?

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplicationUtils.java
##
@@ -181,6 +183,23 @@ public static boolean merge(HiveConf hiveConf, 
ReduceSinkOperator cRS, ReduceSin
 TableDesc keyTable = PlanUtils.getReduceKeyTableDesc(new 
ArrayList(), pRS
 .getConf().getOrder(), pRS.getConf().getNullOrder());
 pRS.getConf().setKeySerializeInfo(keyTable);
+  } else if (cRS.getConf().getKeyCols() != null && 
cRS.getConf().getKeyCols().size() > 0) {
+ArrayList keyColNames = Lists.newArrayList();
+for (ExprNodeDesc keyCol : pRS.getConf().getKeyCols()) {
+  String keyColName = keyCol.getExprString();
+  keyColNames.add(keyColName);
+}
+List fields = 
PlanUtils.getFieldSchemasFromColumnList(pRS.getConf().getKeyCols(),
+keyColNames, 0, "");
+TableDesc keyTable = PlanUtils.getReduceKeyTableDesc(fields, 
pRS.getConf().getOrder(),
+pRS.getConf().getNullOrder());
+ArrayList outputKeyCols = Lists.newArrayList();
+for (int i = 0; i < fields.size(); i++) {
+  outputKeyCols.add(fields.get(i).getName());
+}
+pRS.getConf().setOutputKeyColumnNames(outputKeyCols);
+pRS.getConf().setKeySerializeInfo(keyTable);
+
pRS.getConf().setNumDistributionKeys(cRS.getConf().getNumDistributionKeys());
   }

Review comment:
   I think we should be merging the child into the parent inside this "if" 
- and we have 2  specific conditionals which are handled - so I think an else 
false here would be needed - to close down unhandled future cases





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 475813)
Time Spent: 40m  (was: 0.5h)

> NPE when inserting data with 'distribute by' clause with dynpart sort 
> optimization
> --
>
> Key: HIVE-18284
> URL: https://issues.apache.org/jira/browse/HIVE-18284
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Aki Tanaka
>Assignee: Syed Shameerur Rahman
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> A Null Pointer Exception occurs when inserting data with 'distribute by' 
> clause. The following snippet query reproduces this issue:
> *(non-vectorized , non-llap mode)*
> {code:java}
> create table table1 (col1 string, datekey int);
> insert into table1 values ('ROW1', 1), ('ROW2', 2), ('ROW3', 1);
> create table table2 (col1 string) partitioned by (datekey int);
> set hive.vectorized.execution.enabled=false;
> set hive.optimize.sort.dynamic.partition=true;
> set hive.exec.dynamic.partition.mode=nonstrict;
> insert into table table2
> PARTITION(datekey)
> select col1,
> datekey
> from table1
> distribute by datekey ;
> {code}
> I could run the insert query without the error if I remove Distribute By  or 
> use Cluster By clause.
> It seems that the issue happens because Distribute By does not guarantee 
> clustering or sorting properties on the distributed keys.
> FileSinkOperator removes the previous fsp. FileSinkOperator will remove the 
> previous fsp which might be re-used when we use Distribute By.
> 

[jira] [Work logged] (HIVE-18284) NPE when inserting data with 'distribute by' clause with dynpart sort optimization

2020-08-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-18284?focusedWorklogId=475805=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-475805
 ]

ASF GitHub Bot logged work on HIVE-18284:
-

Author: ASF GitHub Bot
Created on: 28/Aug/20 11:57
Start Date: 28/Aug/20 11:57
Worklog Time Spent: 10m 
  Work Description: shameersss1 commented on pull request #1400:
URL: https://github.com/apache/hive/pull/1400#issuecomment-682485412


   @jcamachor @kgyrtkirk Ping for review request!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 475805)
Time Spent: 0.5h  (was: 20m)

> NPE when inserting data with 'distribute by' clause with dynpart sort 
> optimization
> --
>
> Key: HIVE-18284
> URL: https://issues.apache.org/jira/browse/HIVE-18284
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Aki Tanaka
>Assignee: Syed Shameerur Rahman
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> A Null Pointer Exception occurs when inserting data with 'distribute by' 
> clause. The following snippet query reproduces this issue:
> *(non-vectorized , non-llap mode)*
> {code:java}
> create table table1 (col1 string, datekey int);
> insert into table1 values ('ROW1', 1), ('ROW2', 2), ('ROW3', 1);
> create table table2 (col1 string) partitioned by (datekey int);
> set hive.vectorized.execution.enabled=false;
> set hive.optimize.sort.dynamic.partition=true;
> set hive.exec.dynamic.partition.mode=nonstrict;
> insert into table table2
> PARTITION(datekey)
> select col1,
> datekey
> from table1
> distribute by datekey ;
> {code}
> I could run the insert query without the error if I remove Distribute By  or 
> use Cluster By clause.
> It seems that the issue happens because Distribute By does not guarantee 
> clustering or sorting properties on the distributed keys.
> FileSinkOperator removes the previous fsp. FileSinkOperator will remove the 
> previous fsp which might be re-used when we use Distribute By.
> https://github.com/apache/hive/blob/branch-2.3/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L972
> The following stack trace is logged.
> {code:java}
> Vertex failed, vertexName=Reducer 2, vertexId=vertex_1513111717879_0056_1_01, 
> diagnostics=[Task failed, taskId=task_1513111717879_0056_1_01_00, 
> diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( 
> failure ) : 
> attempt_1513111717879_0056_1_01_00_0:java.lang.RuntimeException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while 
> processing row (tag=0) {"key":{},"value":{"_col0":"ROW3","_col1":1}}
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:211)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
>   at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime 
> Error while processing row (tag=0) 
> {"key":{},"value":{"_col0":"ROW3","_col1":1}}
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:365)
>   at 
> 

[jira] [Work logged] (HIVE-12679) Allow users to be able to specify an implementation of IMetaStoreClient via HiveConf

2020-08-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-12679?focusedWorklogId=475789=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-475789
 ]

ASF GitHub Bot logged work on HIVE-12679:
-

Author: ASF GitHub Bot
Created on: 28/Aug/20 10:19
Start Date: 28/Aug/20 10:19
Worklog Time Spent: 10m 
  Work Description: moomindani edited a comment on pull request #1402:
URL: https://github.com/apache/hive/pull/1402#issuecomment-682448121


   @sankarh Thank you for the comment, I will try to address it.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 475789)
Time Spent: 1h 10m  (was: 1h)

> Allow users to be able to specify an implementation of IMetaStoreClient via 
> HiveConf
> 
>
> Key: HIVE-12679
> URL: https://issues.apache.org/jira/browse/HIVE-12679
> Project: Hive
>  Issue Type: Improvement
>  Components: Configuration, Metastore, Query Planning
>Reporter: Austin Lee
>Assignee: Noritaka Sekiyama
>Priority: Minor
>  Labels: metastore, pull-request-available
> Attachments: HIVE-12679.1.patch, HIVE-12679.2.patch, 
> HIVE-12679.branch-1.2.patch, HIVE-12679.branch-2.3.patch, HIVE-12679.patch
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Hi,
> I would like to propose a change that would make it possible for users to 
> choose an implementation of IMetaStoreClient via HiveConf, i.e. 
> hive-site.xml.  Currently, in Hive the choice is hard coded to be 
> SessionHiveMetaStoreClient in org.apache.hadoop.hive.ql.metadata.Hive.  There 
> is no other direct reference to SessionHiveMetaStoreClient other than the 
> hard coded class name in Hive.java and the QL component operates only on the 
> IMetaStoreClient interface so the change would be minimal and it would be 
> quite similar to how an implementation of RawStore is specified and loaded in 
> hive-metastore.  One use case this change would serve would be one where a 
> user wishes to use an implementation of this interface without the dependency 
> on the Thrift server.
>   
> Thank you,
> Austin



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-12679) Allow users to be able to specify an implementation of IMetaStoreClient via HiveConf

2020-08-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-12679?focusedWorklogId=475788=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-475788
 ]

ASF GitHub Bot logged work on HIVE-12679:
-

Author: ASF GitHub Bot
Created on: 28/Aug/20 10:18
Start Date: 28/Aug/20 10:18
Worklog Time Spent: 10m 
  Work Description: moomindani commented on pull request #1402:
URL: https://github.com/apache/hive/pull/1402#issuecomment-682448121


   sankarh@ Thank you for the comment, I will try to address it.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 475788)
Time Spent: 1h  (was: 50m)

> Allow users to be able to specify an implementation of IMetaStoreClient via 
> HiveConf
> 
>
> Key: HIVE-12679
> URL: https://issues.apache.org/jira/browse/HIVE-12679
> Project: Hive
>  Issue Type: Improvement
>  Components: Configuration, Metastore, Query Planning
>Reporter: Austin Lee
>Assignee: Noritaka Sekiyama
>Priority: Minor
>  Labels: metastore, pull-request-available
> Attachments: HIVE-12679.1.patch, HIVE-12679.2.patch, 
> HIVE-12679.branch-1.2.patch, HIVE-12679.branch-2.3.patch, HIVE-12679.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Hi,
> I would like to propose a change that would make it possible for users to 
> choose an implementation of IMetaStoreClient via HiveConf, i.e. 
> hive-site.xml.  Currently, in Hive the choice is hard coded to be 
> SessionHiveMetaStoreClient in org.apache.hadoop.hive.ql.metadata.Hive.  There 
> is no other direct reference to SessionHiveMetaStoreClient other than the 
> hard coded class name in Hive.java and the QL component operates only on the 
> IMetaStoreClient interface so the change would be minimal and it would be 
> quite similar to how an implementation of RawStore is specified and loaded in 
> hive-metastore.  One use case this change would serve would be one where a 
> user wishes to use an implementation of this interface without the dependency 
> on the Thrift server.
>   
> Thank you,
> Austin



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-22622) Hive allows to create a struct with duplicate attribute names

2020-08-28 Thread Krisztian Kasa (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-22622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17186410#comment-17186410
 ] 

Krisztian Kasa commented on HIVE-22622:
---

[~jcamachorodriguez]
It seems that this was fixed already because I was not able to repro: 
{code}
create table test_struct( duplicateColumn struct);

insert into test_struct select named_struct("id",1,"id",2);

select * from test_struct;
{"id":1,"id":2}

select duplicateColumn.id from test_struct;
1
{code}

Should the field name uniqueness checked when creating the table anyway? Since 
only one of the fields can be queried directly.

> Hive allows to create a struct with duplicate attribute names
> -
>
> Key: HIVE-22622
> URL: https://issues.apache.org/jira/browse/HIVE-22622
> Project: Hive
>  Issue Type: Bug
>Reporter: Denys Kuzmenko
>Assignee: Krisztian Kasa
>Priority: Major
>
> When you create at table with a struct with twice the same attribute name, 
> hive allow you to create it.
> create table test_struct( duplicateColumn struct);
> You can insert data into it :
> insert into test_struct select named_struct("id",1,"id",1);
> But you can not read it :
> select * from test_struct;
> Return : java.io.IOException: java.io.IOException: Error reading file: 
> hdfs://.../test_struct/delta_001_001_/bucket_0 ,
> We can create and insert. but fail on read the Struct part of the tables. We 
> can still read all other columns (if we have more than one) but not the 
> struct anymore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)