[jira] [Created] (HIVE-20335) Add tests for materialized view rewriting with composite aggregation functions
Jesus Camacho Rodriguez created HIVE-20335: -- Summary: Add tests for materialized view rewriting with composite aggregation functions Key: HIVE-20335 URL: https://issues.apache.org/jira/browse/HIVE-20335 Project: Hive Issue Type: Test Components: Materialized views, Test Reporter: Jesus Camacho Rodriguez Assignee: Jesus Camacho Rodriguez -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Review Request 68261: HIVE-20332
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/68261/ --- Review request for hive and Ashutosh Chauhan. Bugs: HIVE-20332 https://issues.apache.org/jira/browse/HIVE-20332 Repository: hive-git Description --- HIVE-20332 Diffs - common/src/java/org/apache/hadoop/hive/conf/HiveConf.java e251920d8b1f14b6ca0df7855385702d7a2e2904 ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveDefaultRelMetadataProvider.java 635d27e723dc1d260574723296f3484c26106a9c ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveMaterializedViewsRelMetadataProvider.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java 43f8508ffbf4ba3cc46016e1d300d6ca9c2e8ccb ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdCumulativeCost.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdDistinctRowCount.java 80b939a9f65142baa149b79460b753ddf469aacf ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdSelectivity.java 575902d78de2a7f95585c23a3c2fc03b9ce89478 ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdSize.java 97097381d9619e67bcab8a268d571d2a392485b3 ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdUniqueKeys.java 3bf62c535cec1e7a3eac43f0ce40879dbfc89799 ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java 361f150193a155d45eb64266f88eb88f0a881ad3 ql/src/test/results/clientpositive/llap/materialized_view_partitioned.q.out b12df11a98e55c00c8b77e8292666373f3509364 ql/src/test/results/clientpositive/llap/materialized_view_rebuild.q.out 4d37d82b6e1f3d4ab8b76c391fa94176356093c2 Diff: https://reviews.apache.org/r/68261/diff/1/ Testing --- Thanks, Jesús Camacho Rodríguez
[jira] [Created] (HIVE-20334) Don't change casing of columns and partitions or make system case agnostic
nirav patel created HIVE-20334: -- Summary: Don't change casing of columns and partitions or make system case agnostic Key: HIVE-20334 URL: https://issues.apache.org/jira/browse/HIVE-20334 Project: Hive Issue Type: Bug Components: Hive, HiveServer2, Metastore Affects Versions: 2.1.1 Reporter: nirav patel Looks like hive internally stores all columns and partitions in lowercase. This does create issue while updating partition via spark. I have detailed the issue here: [https://stackoverflow.com/questions/51713878/spark-hive-upsert-into-dynamic-partition-hive-table-throws-an-error-partitio/51733845#51733845] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20333) CBO: Join removal based on PK-FK declared constraints
Gopal V created HIVE-20333: -- Summary: CBO: Join removal based on PK-FK declared constraints Key: HIVE-20333 URL: https://issues.apache.org/jira/browse/HIVE-20333 Project: Hive Issue Type: Bug Reporter: Gopal V A query of the following shape can have its customer join removed entirely on the basis of the key containment between customer & store_sales. {code} select c_customer_sk,sum(ss_quantity*ss_sales_price) ssales from store_sales ,customer where ss_customer_sk = c_customer_sk group by c_customer_sk; {code} This query after join removal can be encoded in as {code} select ss_customer_sk as c_customer_sk,sum(ss_quantity*ss_sales_price) ssales from store_sales where ss_customer_sk is not null group by ss_customer_sk; {code} The rewrite is not applied today and the current PK-FK relationship does not allow for a nullable relationship (i.e a declared Foreign Key can't be NULL). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20332) Materialized views: Introduce heuristic on selectivity over ROW__ID to favour incremental rebuild
Jesus Camacho Rodriguez created HIVE-20332: -- Summary: Materialized views: Introduce heuristic on selectivity over ROW__ID to favour incremental rebuild Key: HIVE-20332 URL: https://issues.apache.org/jira/browse/HIVE-20332 Project: Hive Issue Type: Improvement Components: Materialized views Reporter: Jesus Camacho Rodriguez Assignee: Jesus Camacho Rodriguez Currently, we do not expose stats over {{ROW__ID.writeId}} to the optimizer. Even if we did, we always assume uniform distribution of the column values, which can easily lead to overestimations on the number of rows read when we filter on {{ROW__ID.writeId}} for materialized views (think about a large transaction for MV creation and then small ones for incremental maintenance). This overestimation can lead to incremental view maintenance not being triggered as cost of the incremental plan is overestimated (we think we will read more rows than we actually do). This could be fixed by introducing histograms that reflect better the column values distribution. Till that moment, we will use a config variable that will set the selectivity for filter condition on ROW__ID during the cost calculation. Setting that variable to a low value will favour incremental rebuild over full rebuild. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20331) Query with union all, lateral view and Join fails with "cannot find parent in the child operator"
Aihua Xu created HIVE-20331: --- Summary: Query with union all, lateral view and Join fails with "cannot find parent in the child operator" Key: HIVE-20331 URL: https://issues.apache.org/jira/browse/HIVE-20331 Project: Hive Issue Type: Bug Components: Physical Optimizer Affects Versions: 2.1.1 Reporter: Aihua Xu Assignee: Aihua Xu The following query with Union, Lateral view and Join will fail during execution with the exception below. {noformat} create table t1(col1 int); SELECT 1 AS `col1` FROM t1 UNION ALL SELECT 2 AS `col1` FROM (SELECT col1 FROM t1 ) x1 JOIN (SELECT col1 FROM (SELECT Row_Number() over (PARTITION BY col1 ORDER BY col1) AS `col1` FROM t1 ) x2 lateral VIEW explode(map(10,1))`mapObj` AS `col2`, `col3` ) `expdObj` {noformat} {noformat} Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive internal error: cannot find parent in the child operator! at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:362) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] at org.apache.hadoop.hive.ql.exec.MapOperator.initializeMapOperator(MapOperator.java:509) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.configure(ExecMapper.java:116) ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] {noformat} After debugging, seems we have issues in GenMRFileSink1 class in which we are setting incorrect aliasToWork to the MapWork. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20330) HCatLoader cannot handle multiple InputJobInfo objects for a job with multiple inputs
Adam Szita created HIVE-20330: - Summary: HCatLoader cannot handle multiple InputJobInfo objects for a job with multiple inputs Key: HIVE-20330 URL: https://issues.apache.org/jira/browse/HIVE-20330 Project: Hive Issue Type: Bug Components: HCatalog Reporter: Adam Szita Assignee: Adam Szita While running performance tests on Pig (0.12 and 0.17) we've observed a huge performance drop in a workload that has multiple inputs from HCatLoader. The reason is that for a particular MR job with multiple Hive tables as input, Pig calls {{setLocation}} on each {{LoaderFunc (HCatLoader)}} instance but only one table's information (InputJobInfo instance) gets tracked in the JobConf. (This is under config key {{HCatConstants.HCAT_KEY_JOB_INFO}}). Any such call overwrites preexisting values, and thus only the last table's information will be considered when Pig calls {{getStatistics}} to calculate and estimate required reducer count. In cases when there are 2 input tables, 256GB and 1MB in size respectively, Pig will query the size information from HCat for both of them, but it will either see 1MB+1MB=2MB or 256GB+256GB=0.5TB depending on input order in the execution plan's DAG. It should of course see 256.00097GB in total and use 257 reducers by default accordingly. In unlucky cases this will be 2MB and 1 reducer will have to struggle with 256GB... -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] hive pull request #410: HIVE-20264: Bootstrap repl dump with concurrent writ...
GitHub user sankarh opened a pull request: https://github.com/apache/hive/pull/410 HIVE-20264: Bootstrap repl dump with concurrent write and drop of ACID table makes target inconsistent. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sankarh/hive HIVE-20264 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/hive/pull/410.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #410 commit 763463f961229a69f3e714c0f5613aaeca2ddabd Author: Sankar Hariappan Date: 2018-08-07T11:38:14Z HIVE-20264: Bootstrap repl dump with concurrent write and drop of ACID table makes target inconsistent. ---
[jira] [Created] (HIVE-20329) Repl Scale Test : Running long running load (incr/bootstrap) causing OOM error
mahesh kumar behera created HIVE-20329: -- Summary: Repl Scale Test : Running long running load (incr/bootstrap) causing OOM error Key: HIVE-20329 URL: https://issues.apache.org/jira/browse/HIVE-20329 Project: Hive Issue Type: Task Components: repl Affects Versions: 3.1.0, 4.0.0 Reporter: mahesh kumar behera Assignee: mahesh kumar behera Fix For: 4.0.0, 3.2.0 Add tags in jobconf for distcp related jobs started by replication. This will allow hive to kill these jobs in case beacon retries, or hs2 dies and beacon issues a kill command. * one of the tags should definitely be the query_id that starts the job : With this flow beacon before retrying the bootstrap load, will issue a kill command to hs2 with the query id of the previous issued command. hs2 will then kill an running jobs on yarn tagged with the Query_id. * To get around the additional failure point as mentioned above. The jobs can be tagged with an additional unique tag_id provided by Beacon in the WITH clause in repl load command to be used to tag distcp jobs ). Enhance the kill api to take the tag as input and kill jobs associated with that tag. Problem here is how do we validate the association of the tag with a hive query id to make sure this api is not used to kill jobs run by other components, however we can provide this capability to only admins and should be ok in that case. -- This message was sent by Atlassian JIRA (v7.6.3#76005)