[jira] [Created] (HIVE-20335) Add tests for materialized view rewriting with composite aggregation functions

2018-08-07 Thread Jesus Camacho Rodriguez (JIRA)
Jesus Camacho Rodriguez created HIVE-20335:
--

 Summary: Add tests for materialized view rewriting with composite 
aggregation functions
 Key: HIVE-20335
 URL: https://issues.apache.org/jira/browse/HIVE-20335
 Project: Hive
  Issue Type: Test
  Components: Materialized views, Test
Reporter: Jesus Camacho Rodriguez
Assignee: Jesus Camacho Rodriguez






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Review Request 68261: HIVE-20332

2018-08-07 Thread Jesús Camacho Rodríguez

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68261/
---

Review request for hive and Ashutosh Chauhan.


Bugs: HIVE-20332
https://issues.apache.org/jira/browse/HIVE-20332


Repository: hive-git


Description
---

HIVE-20332


Diffs
-

  common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 
e251920d8b1f14b6ca0df7855385702d7a2e2904 
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveDefaultRelMetadataProvider.java
 635d27e723dc1d260574723296f3484c26106a9c 
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveMaterializedViewsRelMetadataProvider.java
 PRE-CREATION 
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java
 43f8508ffbf4ba3cc46016e1d300d6ca9c2e8ccb 
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdCumulativeCost.java
 PRE-CREATION 
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdDistinctRowCount.java
 80b939a9f65142baa149b79460b753ddf469aacf 
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdSelectivity.java
 575902d78de2a7f95585c23a3c2fc03b9ce89478 
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdSize.java
 97097381d9619e67bcab8a268d571d2a392485b3 
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdUniqueKeys.java
 3bf62c535cec1e7a3eac43f0ce40879dbfc89799 
  ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java 
361f150193a155d45eb64266f88eb88f0a881ad3 
  ql/src/test/results/clientpositive/llap/materialized_view_partitioned.q.out 
b12df11a98e55c00c8b77e8292666373f3509364 
  ql/src/test/results/clientpositive/llap/materialized_view_rebuild.q.out 
4d37d82b6e1f3d4ab8b76c391fa94176356093c2 


Diff: https://reviews.apache.org/r/68261/diff/1/


Testing
---


Thanks,

Jesús Camacho Rodríguez



[jira] [Created] (HIVE-20334) Don't change casing of columns and partitions or make system case agnostic

2018-08-07 Thread nirav patel (JIRA)
nirav patel created HIVE-20334:
--

 Summary: Don't  change casing of columns and partitions or make 
system case agnostic
 Key: HIVE-20334
 URL: https://issues.apache.org/jira/browse/HIVE-20334
 Project: Hive
  Issue Type: Bug
  Components: Hive, HiveServer2, Metastore
Affects Versions: 2.1.1
Reporter: nirav patel


Looks like hive internally stores all columns and partitions in lowercase. This 
does create issue while updating partition via spark. I have detailed the issue 
here:

[https://stackoverflow.com/questions/51713878/spark-hive-upsert-into-dynamic-partition-hive-table-throws-an-error-partitio/51733845#51733845]

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20333) CBO: Join removal based on PK-FK declared constraints

2018-08-07 Thread Gopal V (JIRA)
Gopal V created HIVE-20333:
--

 Summary: CBO: Join removal based on PK-FK declared constraints
 Key: HIVE-20333
 URL: https://issues.apache.org/jira/browse/HIVE-20333
 Project: Hive
  Issue Type: Bug
Reporter: Gopal V


A query of the following shape can have its customer join removed entirely on 
the basis of the key containment between customer & store_sales.

{code}
select c_customer_sk,sum(ss_quantity*ss_sales_price) ssales from store_sales 
,customer where ss_customer_sk = c_customer_sk group by c_customer_sk;
{code}

This query after join removal can be encoded in as 

{code}
select ss_customer_sk as c_customer_sk,sum(ss_quantity*ss_sales_price) ssales 
from store_sales where ss_customer_sk is not null group by ss_customer_sk;
{code}

The rewrite is not applied today and the current PK-FK relationship does not 
allow for a nullable relationship (i.e a declared Foreign Key can't be NULL).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20332) Materialized views: Introduce heuristic on selectivity over ROW__ID to favour incremental rebuild

2018-08-07 Thread Jesus Camacho Rodriguez (JIRA)
Jesus Camacho Rodriguez created HIVE-20332:
--

 Summary: Materialized views: Introduce heuristic on selectivity 
over ROW__ID to favour incremental rebuild
 Key: HIVE-20332
 URL: https://issues.apache.org/jira/browse/HIVE-20332
 Project: Hive
  Issue Type: Improvement
  Components: Materialized views
Reporter: Jesus Camacho Rodriguez
Assignee: Jesus Camacho Rodriguez


Currently, we do not expose stats over {{ROW__ID.writeId}} to the optimizer. 
Even if we did, we always assume uniform distribution of the column values, 
which can easily lead to overestimations on the number of rows read when we 
filter on {{ROW__ID.writeId}} for materialized views (think about a large 
transaction for MV creation and then small ones for incremental maintenance). 
This overestimation can lead to incremental view maintenance not being 
triggered as cost of the incremental plan is overestimated (we think we will 
read more rows than we actually do). This could be fixed by introducing 
histograms that reflect better the column values distribution.

Till that moment, we will use a config variable that will set the selectivity 
for filter condition on ROW__ID during the cost calculation. Setting that 
variable to a low value will favour incremental rebuild over full rebuild.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20331) Query with union all, lateral view and Join fails with "cannot find parent in the child operator"

2018-08-07 Thread Aihua Xu (JIRA)
Aihua Xu created HIVE-20331:
---

 Summary: Query with union all, lateral view and Join fails with 
"cannot find parent in the child operator"
 Key: HIVE-20331
 URL: https://issues.apache.org/jira/browse/HIVE-20331
 Project: Hive
  Issue Type: Bug
  Components: Physical Optimizer
Affects Versions: 2.1.1
Reporter: Aihua Xu
Assignee: Aihua Xu


The following query with Union, Lateral view and Join will fail during 
execution with the exception below.
{noformat}
create table t1(col1 int);
SELECT 1 AS `col1`
FROM t1
UNION ALL
  SELECT 2 AS `col1`
  FROM
(SELECT col1
 FROM t1
) x1
JOIN
  (SELECT col1
  FROM
(SELECT 
  Row_Number() over (PARTITION BY col1 ORDER BY col1) AS `col1`
FROM t1
) x2 lateral VIEW explode(map(10,1))`mapObj` AS `col2`, `col3`
  ) `expdObj`  
{noformat}

{noformat}
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive internal 
error: cannot find parent in the child operator!
at 
org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:362) 
~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at 
org.apache.hadoop.hive.ql.exec.MapOperator.initializeMapOperator(MapOperator.java:509)
 ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
at 
org.apache.hadoop.hive.ql.exec.mr.ExecMapper.configure(ExecMapper.java:116) 
~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
{noformat}

After debugging, seems we have issues in GenMRFileSink1 class in which we are 
setting incorrect aliasToWork to the MapWork.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20330) HCatLoader cannot handle multiple InputJobInfo objects for a job with multiple inputs

2018-08-07 Thread Adam Szita (JIRA)
Adam Szita created HIVE-20330:
-

 Summary: HCatLoader cannot handle multiple InputJobInfo objects 
for a job with multiple inputs
 Key: HIVE-20330
 URL: https://issues.apache.org/jira/browse/HIVE-20330
 Project: Hive
  Issue Type: Bug
  Components: HCatalog
Reporter: Adam Szita
Assignee: Adam Szita


While running performance tests on Pig (0.12 and 0.17) we've observed a huge 
performance drop in a workload that has multiple inputs from HCatLoader.

The reason is that for a particular MR job with multiple Hive tables as input, 
Pig calls {{setLocation}} on each {{LoaderFunc (HCatLoader)}} instance but only 
one table's information (InputJobInfo instance) gets tracked in the JobConf. 
(This is under config key {{HCatConstants.HCAT_KEY_JOB_INFO}}).

Any such call overwrites preexisting values, and thus only the last table's 
information will be considered when Pig calls {{getStatistics}} to calculate 
and estimate required reducer count.

In cases when there are 2 input tables, 256GB and 1MB in size respectively, Pig 
will query the size information from HCat for both of them, but it will either 
see 1MB+1MB=2MB or 256GB+256GB=0.5TB depending on input order in the execution 
plan's DAG.
It should of course see 256.00097GB in total and use 257 reducers by default 
accordingly.

In unlucky cases this will be 2MB and 1 reducer will have to struggle with 
256GB...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] hive pull request #410: HIVE-20264: Bootstrap repl dump with concurrent writ...

2018-08-07 Thread sankarh
GitHub user sankarh opened a pull request:

https://github.com/apache/hive/pull/410

HIVE-20264: Bootstrap repl dump with concurrent write and drop of ACID 
table makes target inconsistent.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sankarh/hive HIVE-20264

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/hive/pull/410.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #410


commit 763463f961229a69f3e714c0f5613aaeca2ddabd
Author: Sankar Hariappan 
Date:   2018-08-07T11:38:14Z

HIVE-20264: Bootstrap repl dump with concurrent write and drop of ACID 
table makes target inconsistent.




---


[jira] [Created] (HIVE-20329) Repl Scale Test : Running long running load (incr/bootstrap) causing OOM error

2018-08-07 Thread mahesh kumar behera (JIRA)
mahesh kumar behera created HIVE-20329:
--

 Summary: Repl Scale Test : Running long running load 
(incr/bootstrap) causing OOM error
 Key: HIVE-20329
 URL: https://issues.apache.org/jira/browse/HIVE-20329
 Project: Hive
  Issue Type: Task
  Components: repl
Affects Versions: 3.1.0, 4.0.0
Reporter: mahesh kumar behera
Assignee: mahesh kumar behera
 Fix For: 4.0.0, 3.2.0


Add tags in jobconf for distcp related jobs started by replication. This will 
allow hive to kill these jobs in case beacon retries, or hs2 dies and beacon 
issues a kill command.
 * one of the tags should definitely be the query_id that starts the job : With 
this flow beacon before retrying the bootstrap load, will issue a kill command 
to hs2 with the query id of the previous issued command. hs2 will then kill an 
running jobs on yarn tagged with the Query_id.

 * To get around the additional failure point as mentioned above. The jobs can 
be tagged with an additional unique tag_id provided by Beacon in the WITH 
clause in repl load command to be used to tag distcp jobs ). Enhance the kill 
api to take the tag as input and kill jobs associated with that tag. Problem 
here is how do we validate the association of the tag with a hive query id to 
make sure this api is not used to kill jobs run by other components, however we 
can provide this capability to only admins and should be ok in that case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)