[jira] [Work logged] (HIVE-23493) Rewrite plan to join back tables with many projected columns joined multiple times

2020-06-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23493?focusedWorklogId=447075&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-447075
 ]

ASF GitHub Bot logged work on HIVE-23493:
-

Author: ASF GitHub Bot
Created on: 17/Jun/20 05:25
Start Date: 17/Jun/20 05:25
Worklog Time Spent: 10m 
  Work Description: kasakrisz closed pull request #1132:
URL: https://github.com/apache/hive/pull/1132


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 447075)
Time Spent: 2h  (was: 1h 50m)

> Rewrite plan to join back tables with many projected columns joined multiple 
> times
> --
>
> Key: HIVE-23493
> URL: https://issues.apache.org/jira/browse/HIVE-23493
> Project: Hive
>  Issue Type: New Feature
>  Components: CBO
>Reporter: Krisztian Kasa
>Assignee: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23493.1.patch
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Queries with a pattern where one or more tables joins with a fact table in a 
> CTE. Many columns are projected out those tables and then grouped in the CTE. 
>  The main query joins multiple instances of the CTE and may project a subset 
> of these.
> The optimization is to rewrite the CTE to include only key (PK, non null 
> Unique Key) columns and join the tables back to the resultset of the main 
> query to fetch the rest of the wide columns. This reduces the datasize of the 
> joined back tables that is broadcast/shuffled throughout the DAG processing.
> Example query, tpc-ds query4
> {code}
> with year_total as (
>  select c_customer_id customer_id
>,c_first_name customer_first_name
>,c_last_name customer_last_name
>,c_preferred_cust_flag customer_preferred_cust_flag
>,c_birth_country customer_birth_country
>,c_login customer_login
>,c_email_address customer_email_address
>,d_year dyear
>
> ,sum(((ss_ext_list_price-ss_ext_wholesale_cost-ss_ext_discount_amt)+ss_ext_sales_price)/2)
>  year_total
>,'s' sale_type
>  from customer
>  ,store_sales
>  ,date_dim
>  where c_customer_sk = ss_customer_sk
>and ss_sold_date_sk = d_date_sk
>  group by c_customer_id
>  ,c_first_name
>  ,c_last_name
>  ,c_preferred_cust_flag
>  ,c_birth_country
>  ,c_login
>  ,c_email_address
>  ,d_year
>  union all
>  select c_customer_id customer_id
>,c_first_name customer_first_name
>,c_last_name customer_last_name
>,c_preferred_cust_flag customer_preferred_cust_flag
>,c_birth_country customer_birth_country
>,c_login customer_login
>,c_email_address customer_email_address
>,d_year dyear
>
> ,sumcs_ext_list_price-cs_ext_wholesale_cost-cs_ext_discount_amt)+cs_ext_sales_price)/2)
>  ) year_total
>,'c' sale_type
>  from customer
>  ,catalog_sales
>  ,date_dim
>  where c_customer_sk = cs_bill_customer_sk
>and cs_sold_date_sk = d_date_sk
>  group by c_customer_id
>  ,c_first_name
>  ,c_last_name
>  ,c_preferred_cust_flag
>  ,c_birth_country
>  ,c_login
>  ,c_email_address
>  ,d_year
> union all
>  select c_customer_id customer_id
>,c_first_name customer_first_name
>,c_last_name customer_last_name
>,c_preferred_cust_flag customer_preferred_cust_flag
>,c_birth_country customer_birth_country
>,c_login customer_login
>,c_email_address customer_email_address
>,d_year dyear
>
> ,sumws_ext_list_price-ws_ext_wholesale_cost-ws_ext_discount_amt)+ws_ext_sales_price)/2)
>  ) year_total
>,'w' sale_type
>  from customer
>  ,web_sales
>  ,date_dim
>  where c_customer_sk = ws_bill_customer_sk
>and ws_sold_date_sk = d_date_sk
>  group by c_customer_id
>  ,c_first_name
>  ,c_last_name
>  ,c_preferred_cust_flag
>  ,c_birth_country
>  ,c_login
>  ,c_email_address
>  ,d_year
>  )
>   select  
>   t_s_secyear.customer_id
>  ,t_s_secyear.customer_first_name
>  ,t_s_secyear.customer_last_name
>  ,t_s_secyear.customer_birth_country
>  from year_total t_s_firstyear
>  ,year_total t_s_secyear
>  ,year_total t_c_firstyear
> 

[jira] [Work logged] (HIVE-23493) Rewrite plan to join back tables with many projected columns joined multiple times

2020-06-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23493?focusedWorklogId=447071&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-447071
 ]

ASF GitHub Bot logged work on HIVE-23493:
-

Author: ASF GitHub Bot
Created on: 17/Jun/20 05:22
Start Date: 17/Jun/20 05:22
Worklog Time Spent: 10m 
  Work Description: kasakrisz closed pull request #1096:
URL: https://github.com/apache/hive/pull/1096


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 447071)
Time Spent: 1h 50m  (was: 1h 40m)

> Rewrite plan to join back tables with many projected columns joined multiple 
> times
> --
>
> Key: HIVE-23493
> URL: https://issues.apache.org/jira/browse/HIVE-23493
> Project: Hive
>  Issue Type: New Feature
>  Components: CBO
>Reporter: Krisztian Kasa
>Assignee: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23493.1.patch
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Queries with a pattern where one or more tables joins with a fact table in a 
> CTE. Many columns are projected out those tables and then grouped in the CTE. 
>  The main query joins multiple instances of the CTE and may project a subset 
> of these.
> The optimization is to rewrite the CTE to include only key (PK, non null 
> Unique Key) columns and join the tables back to the resultset of the main 
> query to fetch the rest of the wide columns. This reduces the datasize of the 
> joined back tables that is broadcast/shuffled throughout the DAG processing.
> Example query, tpc-ds query4
> {code}
> with year_total as (
>  select c_customer_id customer_id
>,c_first_name customer_first_name
>,c_last_name customer_last_name
>,c_preferred_cust_flag customer_preferred_cust_flag
>,c_birth_country customer_birth_country
>,c_login customer_login
>,c_email_address customer_email_address
>,d_year dyear
>
> ,sum(((ss_ext_list_price-ss_ext_wholesale_cost-ss_ext_discount_amt)+ss_ext_sales_price)/2)
>  year_total
>,'s' sale_type
>  from customer
>  ,store_sales
>  ,date_dim
>  where c_customer_sk = ss_customer_sk
>and ss_sold_date_sk = d_date_sk
>  group by c_customer_id
>  ,c_first_name
>  ,c_last_name
>  ,c_preferred_cust_flag
>  ,c_birth_country
>  ,c_login
>  ,c_email_address
>  ,d_year
>  union all
>  select c_customer_id customer_id
>,c_first_name customer_first_name
>,c_last_name customer_last_name
>,c_preferred_cust_flag customer_preferred_cust_flag
>,c_birth_country customer_birth_country
>,c_login customer_login
>,c_email_address customer_email_address
>,d_year dyear
>
> ,sumcs_ext_list_price-cs_ext_wholesale_cost-cs_ext_discount_amt)+cs_ext_sales_price)/2)
>  ) year_total
>,'c' sale_type
>  from customer
>  ,catalog_sales
>  ,date_dim
>  where c_customer_sk = cs_bill_customer_sk
>and cs_sold_date_sk = d_date_sk
>  group by c_customer_id
>  ,c_first_name
>  ,c_last_name
>  ,c_preferred_cust_flag
>  ,c_birth_country
>  ,c_login
>  ,c_email_address
>  ,d_year
> union all
>  select c_customer_id customer_id
>,c_first_name customer_first_name
>,c_last_name customer_last_name
>,c_preferred_cust_flag customer_preferred_cust_flag
>,c_birth_country customer_birth_country
>,c_login customer_login
>,c_email_address customer_email_address
>,d_year dyear
>
> ,sumws_ext_list_price-ws_ext_wholesale_cost-ws_ext_discount_amt)+ws_ext_sales_price)/2)
>  ) year_total
>,'w' sale_type
>  from customer
>  ,web_sales
>  ,date_dim
>  where c_customer_sk = ws_bill_customer_sk
>and ws_sold_date_sk = d_date_sk
>  group by c_customer_id
>  ,c_first_name
>  ,c_last_name
>  ,c_preferred_cust_flag
>  ,c_birth_country
>  ,c_login
>  ,c_email_address
>  ,d_year
>  )
>   select  
>   t_s_secyear.customer_id
>  ,t_s_secyear.customer_first_name
>  ,t_s_secyear.customer_last_name
>  ,t_s_secyear.customer_birth_country
>  from year_total t_s_firstyear
>  ,year_total t_s_secyear
>  ,year_total t_c_firs

[jira] [Work logged] (HIVE-23493) Rewrite plan to join back tables with many projected columns joined multiple times

2020-06-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23493?focusedWorklogId=447066&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-447066
 ]

ASF GitHub Bot logged work on HIVE-23493:
-

Author: ASF GitHub Bot
Created on: 17/Jun/20 05:08
Start Date: 17/Jun/20 05:08
Worklog Time Spent: 10m 
  Work Description: kasakrisz opened a new pull request #1132:
URL: https://github.com/apache/hive/pull/1132


   fix commit mesage



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 447066)
Time Spent: 1h 40m  (was: 1.5h)

> Rewrite plan to join back tables with many projected columns joined multiple 
> times
> --
>
> Key: HIVE-23493
> URL: https://issues.apache.org/jira/browse/HIVE-23493
> Project: Hive
>  Issue Type: New Feature
>  Components: CBO
>Reporter: Krisztian Kasa
>Assignee: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23493.1.patch
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Queries with a pattern where one or more tables joins with a fact table in a 
> CTE. Many columns are projected out those tables and then grouped in the CTE. 
>  The main query joins multiple instances of the CTE and may project a subset 
> of these.
> The optimization is to rewrite the CTE to include only key (PK, non null 
> Unique Key) columns and join the tables back to the resultset of the main 
> query to fetch the rest of the wide columns. This reduces the datasize of the 
> joined back tables that is broadcast/shuffled throughout the DAG processing.
> Example query, tpc-ds query4
> {code}
> with year_total as (
>  select c_customer_id customer_id
>,c_first_name customer_first_name
>,c_last_name customer_last_name
>,c_preferred_cust_flag customer_preferred_cust_flag
>,c_birth_country customer_birth_country
>,c_login customer_login
>,c_email_address customer_email_address
>,d_year dyear
>
> ,sum(((ss_ext_list_price-ss_ext_wholesale_cost-ss_ext_discount_amt)+ss_ext_sales_price)/2)
>  year_total
>,'s' sale_type
>  from customer
>  ,store_sales
>  ,date_dim
>  where c_customer_sk = ss_customer_sk
>and ss_sold_date_sk = d_date_sk
>  group by c_customer_id
>  ,c_first_name
>  ,c_last_name
>  ,c_preferred_cust_flag
>  ,c_birth_country
>  ,c_login
>  ,c_email_address
>  ,d_year
>  union all
>  select c_customer_id customer_id
>,c_first_name customer_first_name
>,c_last_name customer_last_name
>,c_preferred_cust_flag customer_preferred_cust_flag
>,c_birth_country customer_birth_country
>,c_login customer_login
>,c_email_address customer_email_address
>,d_year dyear
>
> ,sumcs_ext_list_price-cs_ext_wholesale_cost-cs_ext_discount_amt)+cs_ext_sales_price)/2)
>  ) year_total
>,'c' sale_type
>  from customer
>  ,catalog_sales
>  ,date_dim
>  where c_customer_sk = cs_bill_customer_sk
>and cs_sold_date_sk = d_date_sk
>  group by c_customer_id
>  ,c_first_name
>  ,c_last_name
>  ,c_preferred_cust_flag
>  ,c_birth_country
>  ,c_login
>  ,c_email_address
>  ,d_year
> union all
>  select c_customer_id customer_id
>,c_first_name customer_first_name
>,c_last_name customer_last_name
>,c_preferred_cust_flag customer_preferred_cust_flag
>,c_birth_country customer_birth_country
>,c_login customer_login
>,c_email_address customer_email_address
>,d_year dyear
>
> ,sumws_ext_list_price-ws_ext_wholesale_cost-ws_ext_discount_amt)+ws_ext_sales_price)/2)
>  ) year_total
>,'w' sale_type
>  from customer
>  ,web_sales
>  ,date_dim
>  where c_customer_sk = ws_bill_customer_sk
>and ws_sold_date_sk = d_date_sk
>  group by c_customer_id
>  ,c_first_name
>  ,c_last_name
>  ,c_preferred_cust_flag
>  ,c_birth_country
>  ,c_login
>  ,c_email_address
>  ,d_year
>  )
>   select  
>   t_s_secyear.customer_id
>  ,t_s_secyear.customer_first_name
>  ,t_s_secyear.customer_last_name
>  ,t_s_secyear.customer_birth_country
>  from year_total t_s_firstyear
>  ,year_total t_s_secyear
> 

[jira] [Work logged] (HIVE-23493) Rewrite plan to join back tables with many projected columns joined multiple times

2020-06-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23493?focusedWorklogId=447061&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-447061
 ]

ASF GitHub Bot logged work on HIVE-23493:
-

Author: ASF GitHub Bot
Created on: 17/Jun/20 05:00
Start Date: 17/Jun/20 05:00
Worklog Time Spent: 10m 
  Work Description: kasakrisz merged pull request #1124:
URL: https://github.com/apache/hive/pull/1124


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 447061)
Time Spent: 1.5h  (was: 1h 20m)

> Rewrite plan to join back tables with many projected columns joined multiple 
> times
> --
>
> Key: HIVE-23493
> URL: https://issues.apache.org/jira/browse/HIVE-23493
> Project: Hive
>  Issue Type: New Feature
>  Components: CBO
>Reporter: Krisztian Kasa
>Assignee: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23493.1.patch
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Queries with a pattern where one or more tables joins with a fact table in a 
> CTE. Many columns are projected out those tables and then grouped in the CTE. 
>  The main query joins multiple instances of the CTE and may project a subset 
> of these.
> The optimization is to rewrite the CTE to include only key (PK, non null 
> Unique Key) columns and join the tables back to the resultset of the main 
> query to fetch the rest of the wide columns. This reduces the datasize of the 
> joined back tables that is broadcast/shuffled throughout the DAG processing.
> Example query, tpc-ds query4
> {code}
> with year_total as (
>  select c_customer_id customer_id
>,c_first_name customer_first_name
>,c_last_name customer_last_name
>,c_preferred_cust_flag customer_preferred_cust_flag
>,c_birth_country customer_birth_country
>,c_login customer_login
>,c_email_address customer_email_address
>,d_year dyear
>
> ,sum(((ss_ext_list_price-ss_ext_wholesale_cost-ss_ext_discount_amt)+ss_ext_sales_price)/2)
>  year_total
>,'s' sale_type
>  from customer
>  ,store_sales
>  ,date_dim
>  where c_customer_sk = ss_customer_sk
>and ss_sold_date_sk = d_date_sk
>  group by c_customer_id
>  ,c_first_name
>  ,c_last_name
>  ,c_preferred_cust_flag
>  ,c_birth_country
>  ,c_login
>  ,c_email_address
>  ,d_year
>  union all
>  select c_customer_id customer_id
>,c_first_name customer_first_name
>,c_last_name customer_last_name
>,c_preferred_cust_flag customer_preferred_cust_flag
>,c_birth_country customer_birth_country
>,c_login customer_login
>,c_email_address customer_email_address
>,d_year dyear
>
> ,sumcs_ext_list_price-cs_ext_wholesale_cost-cs_ext_discount_amt)+cs_ext_sales_price)/2)
>  ) year_total
>,'c' sale_type
>  from customer
>  ,catalog_sales
>  ,date_dim
>  where c_customer_sk = cs_bill_customer_sk
>and cs_sold_date_sk = d_date_sk
>  group by c_customer_id
>  ,c_first_name
>  ,c_last_name
>  ,c_preferred_cust_flag
>  ,c_birth_country
>  ,c_login
>  ,c_email_address
>  ,d_year
> union all
>  select c_customer_id customer_id
>,c_first_name customer_first_name
>,c_last_name customer_last_name
>,c_preferred_cust_flag customer_preferred_cust_flag
>,c_birth_country customer_birth_country
>,c_login customer_login
>,c_email_address customer_email_address
>,d_year dyear
>
> ,sumws_ext_list_price-ws_ext_wholesale_cost-ws_ext_discount_amt)+ws_ext_sales_price)/2)
>  ) year_total
>,'w' sale_type
>  from customer
>  ,web_sales
>  ,date_dim
>  where c_customer_sk = ws_bill_customer_sk
>and ws_sold_date_sk = d_date_sk
>  group by c_customer_id
>  ,c_first_name
>  ,c_last_name
>  ,c_preferred_cust_flag
>  ,c_birth_country
>  ,c_login
>  ,c_email_address
>  ,d_year
>  )
>   select  
>   t_s_secyear.customer_id
>  ,t_s_secyear.customer_first_name
>  ,t_s_secyear.customer_last_name
>  ,t_s_secyear.customer_birth_country
>  from year_total t_s_firstyear
>  ,year_total t_s_secyear
>  ,year_total t_c_firstyea

[jira] [Work logged] (HIVE-23493) Rewrite plan to join back tables with many projected columns joined multiple times

2020-06-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23493?focusedWorklogId=446581&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-446581
 ]

ASF GitHub Bot logged work on HIVE-23493:
-

Author: ASF GitHub Bot
Created on: 16/Jun/20 15:20
Start Date: 16/Jun/20 15:20
Worklog Time Spent: 10m 
  Work Description: kasakrisz commented on a change in pull request #1124:
URL: https://github.com/apache/hive/pull/1124#discussion_r440936180



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveCardinalityPreservingJoinRule.java
##
@@ -0,0 +1,74 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite.rules;
+
+import org.apache.calcite.plan.RelOptCost;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.plan.RelOptUtil;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.rel.metadata.JaninoRelMetadataProvider;
+import org.apache.calcite.rel.metadata.RelMetadataQuery;
+import 
org.apache.hadoop.hive.ql.optimizer.calcite.HiveDefaultTezModelRelMetadataProvider;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Rule to trigger {@link HiveCardinalityPreservingJoinOptimization} on top of 
the plan.
+ */
+public class HiveCardinalityPreservingJoinRule extends HiveFieldTrimmerRule {
+  private static final Logger LOG = 
LoggerFactory.getLogger(HiveCardinalityPreservingJoinRule.class);
+
+  private final double factor;
+
+  public HiveCardinalityPreservingJoinRule(double factor) {
+super(false, "HiveCardinalityPreservingJoinRule");
+this.factor = Math.max(factor, 0.0);

Review comment:
   `HiveCost` doesn't allow negative numbers:
   
https://github.com/apache/hive/blob/e74029d4fd5c4bfc50d33a8f1155ffacc151ba8f/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/cost/HiveCost.java#L78
   
   But there is a check in `CalcitePlanner`: the rule is not added if the 
factor is 0 or negative
   ```
   if (factor > 0.0) {
   generatePartialProgram(program, false, HepMatchOrder.TOP_DOWN,
   new HiveCardinalityPreservingJoinRule(factor));
 }
   ``` 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 446581)
Time Spent: 1h 20m  (was: 1h 10m)

> Rewrite plan to join back tables with many projected columns joined multiple 
> times
> --
>
> Key: HIVE-23493
> URL: https://issues.apache.org/jira/browse/HIVE-23493
> Project: Hive
>  Issue Type: New Feature
>  Components: CBO
>Reporter: Krisztian Kasa
>Assignee: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23493.1.patch
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Queries with a pattern where one or more tables joins with a fact table in a 
> CTE. Many columns are projected out those tables and then grouped in the CTE. 
>  The main query joins multiple instances of the CTE and may project a subset 
> of these.
> The optimization is to rewrite the CTE to include only key (PK, non null 
> Unique Key) columns and join the tables back to the resultset of the main 
> query to fetch the rest of the wide columns. This reduces the datasize of the 
> joined back tables that is broadcast/shuffled throughout the DAG processing.
> Example query, tpc-ds query4
> {code}
> with year_total as (
>  select c_customer_id customer_id
>,c_first_name customer_first_name
>,c_last_name customer_last_name
>,c_preferred_cust_flag customer_preferred_cust_flag
>,c_birth_country customer_birth_country
>,c_login customer_login
>   

[jira] [Work logged] (HIVE-23493) Rewrite plan to join back tables with many projected columns joined multiple times

2020-06-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23493?focusedWorklogId=446577&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-446577
 ]

ASF GitHub Bot logged work on HIVE-23493:
-

Author: ASF GitHub Bot
Created on: 16/Jun/20 15:13
Start Date: 16/Jun/20 15:13
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on a change in pull request #1124:
URL: https://github.com/apache/hive/pull/1124#discussion_r440929783



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveCardinalityPreservingJoinRule.java
##
@@ -0,0 +1,74 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite.rules;
+
+import org.apache.calcite.plan.RelOptCost;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.plan.RelOptUtil;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.rel.metadata.JaninoRelMetadataProvider;
+import org.apache.calcite.rel.metadata.RelMetadataQuery;
+import 
org.apache.hadoop.hive.ql.optimizer.calcite.HiveDefaultTezModelRelMetadataProvider;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Rule to trigger {@link HiveCardinalityPreservingJoinOptimization} on top of 
the plan.
+ */
+public class HiveCardinalityPreservingJoinRule extends HiveFieldTrimmerRule {
+  private static final Logger LOG = 
LoggerFactory.getLogger(HiveCardinalityPreservingJoinRule.class);
+
+  private final double factor;
+
+  public HiveCardinalityPreservingJoinRule(double factor) {
+super(false, "HiveCardinalityPreservingJoinRule");
+this.factor = Math.max(factor, 0.0);
+  }
+
+  @Override
+  protected RelNode trim(RelOptRuleCall call, RelNode node) {
+RelNode optimized = new 
HiveCardinalityPreservingJoinOptimization().trim(call.builder(), node);
+if (optimized == node) {
+  return node;
+}
+
+JaninoRelMetadataProvider original = 
RelMetadataQuery.THREAD_PROVIDERS.get();

Review comment:
   Can we move this to the top of the method and put the 
`RelMetadataQuery.THREAD_PROVIDERS.set(original);` in a finally block? If there 
is an exception while getting the stats (for instance, if stats are not 
available), we should make sure that we are setting it back to the original one.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 446577)
Time Spent: 1h 10m  (was: 1h)

> Rewrite plan to join back tables with many projected columns joined multiple 
> times
> --
>
> Key: HIVE-23493
> URL: https://issues.apache.org/jira/browse/HIVE-23493
> Project: Hive
>  Issue Type: New Feature
>  Components: CBO
>Reporter: Krisztian Kasa
>Assignee: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23493.1.patch
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Queries with a pattern where one or more tables joins with a fact table in a 
> CTE. Many columns are projected out those tables and then grouped in the CTE. 
>  The main query joins multiple instances of the CTE and may project a subset 
> of these.
> The optimization is to rewrite the CTE to include only key (PK, non null 
> Unique Key) columns and join the tables back to the resultset of the main 
> query to fetch the rest of the wide columns. This reduces the datasize of the 
> joined back tables that is broadcast/shuffled throughout the DAG processing.
> Example query, tpc-ds query4
> {code}
> with year_total as (
>  select c_customer_id customer_id
>,c_first_name customer_first_name
>,c_last_name customer_last_name
>,c_preferred_cus

[jira] [Work logged] (HIVE-23493) Rewrite plan to join back tables with many projected columns joined multiple times

2020-06-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23493?focusedWorklogId=446575&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-446575
 ]

ASF GitHub Bot logged work on HIVE-23493:
-

Author: ASF GitHub Bot
Created on: 16/Jun/20 15:10
Start Date: 16/Jun/20 15:10
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on a change in pull request #1124:
URL: https://github.com/apache/hive/pull/1124#discussion_r440927072



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveCardinalityPreservingJoinRule.java
##
@@ -0,0 +1,74 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite.rules;
+
+import org.apache.calcite.plan.RelOptCost;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.plan.RelOptUtil;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.rel.metadata.JaninoRelMetadataProvider;
+import org.apache.calcite.rel.metadata.RelMetadataQuery;
+import 
org.apache.hadoop.hive.ql.optimizer.calcite.HiveDefaultTezModelRelMetadataProvider;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Rule to trigger {@link HiveCardinalityPreservingJoinOptimization} on top of 
the plan.
+ */
+public class HiveCardinalityPreservingJoinRule extends HiveFieldTrimmerRule {
+  private static final Logger LOG = 
LoggerFactory.getLogger(HiveCardinalityPreservingJoinRule.class);
+
+  private final double factor;
+
+  public HiveCardinalityPreservingJoinRule(double factor) {
+super(false, "HiveCardinalityPreservingJoinRule");
+this.factor = Math.max(factor, 0.0);
+  }
+
+  @Override
+  protected RelNode trim(RelOptRuleCall call, RelNode node) {
+RelNode optimized = new 
HiveCardinalityPreservingJoinOptimization().trim(call.builder(), node);
+if (optimized == node) {
+  return node;
+}
+
+JaninoRelMetadataProvider original = 
RelMetadataQuery.THREAD_PROVIDERS.get();
+RelMetadataQuery.THREAD_PROVIDERS.set(getJaninoRelMetadataProvider());
+RelMetadataQuery metadataQuery = RelMetadataQuery.instance();
+
+RelOptCost optimizedCost = metadataQuery.getCumulativeCost(optimized);
+RelOptCost originalCost = metadataQuery.getCumulativeCost(node);
+originalCost = originalCost.multiplyBy(factor);
+LOG.debug("Original plan cost {} Optimized plan cost {}", originalCost, 
optimizedCost);

Review comment:
   nit. `Original plan cost: {} vs Optimized plan cost: {}` ? Or something 
like that so we can read it more clearly.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 446575)
Time Spent: 1h  (was: 50m)

> Rewrite plan to join back tables with many projected columns joined multiple 
> times
> --
>
> Key: HIVE-23493
> URL: https://issues.apache.org/jira/browse/HIVE-23493
> Project: Hive
>  Issue Type: New Feature
>  Components: CBO
>Reporter: Krisztian Kasa
>Assignee: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23493.1.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Queries with a pattern where one or more tables joins with a fact table in a 
> CTE. Many columns are projected out those tables and then grouped in the CTE. 
>  The main query joins multiple instances of the CTE and may project a subset 
> of these.
> The optimization is to rewrite the CTE to include only key (PK, non null 
> Unique Key) columns and join the tables back to the resultset of the main 
> query to fetch the rest of the wide columns. This reduces the datasize of the 
> joined back tables that is br

[jira] [Work logged] (HIVE-23493) Rewrite plan to join back tables with many projected columns joined multiple times

2020-06-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23493?focusedWorklogId=446573&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-446573
 ]

ASF GitHub Bot logged work on HIVE-23493:
-

Author: ASF GitHub Bot
Created on: 16/Jun/20 15:08
Start Date: 16/Jun/20 15:08
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on a change in pull request #1124:
URL: https://github.com/apache/hive/pull/1124#discussion_r440925660



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveCardinalityPreservingJoinRule.java
##
@@ -0,0 +1,74 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite.rules;
+
+import org.apache.calcite.plan.RelOptCost;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.plan.RelOptUtil;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.rel.metadata.JaninoRelMetadataProvider;
+import org.apache.calcite.rel.metadata.RelMetadataQuery;
+import 
org.apache.hadoop.hive.ql.optimizer.calcite.HiveDefaultTezModelRelMetadataProvider;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Rule to trigger {@link HiveCardinalityPreservingJoinOptimization} on top of 
the plan.
+ */
+public class HiveCardinalityPreservingJoinRule extends HiveFieldTrimmerRule {
+  private static final Logger LOG = 
LoggerFactory.getLogger(HiveCardinalityPreservingJoinRule.class);
+
+  private final double factor;
+
+  public HiveCardinalityPreservingJoinRule(double factor) {
+super(false, "HiveCardinalityPreservingJoinRule");
+this.factor = Math.max(factor, 0.0);

Review comment:
   Maybe you should allow negative numbers (e.g., -1) to disable the 
optimization completely.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 446573)
Time Spent: 50m  (was: 40m)

> Rewrite plan to join back tables with many projected columns joined multiple 
> times
> --
>
> Key: HIVE-23493
> URL: https://issues.apache.org/jira/browse/HIVE-23493
> Project: Hive
>  Issue Type: New Feature
>  Components: CBO
>Reporter: Krisztian Kasa
>Assignee: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23493.1.patch
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Queries with a pattern where one or more tables joins with a fact table in a 
> CTE. Many columns are projected out those tables and then grouped in the CTE. 
>  The main query joins multiple instances of the CTE and may project a subset 
> of these.
> The optimization is to rewrite the CTE to include only key (PK, non null 
> Unique Key) columns and join the tables back to the resultset of the main 
> query to fetch the rest of the wide columns. This reduces the datasize of the 
> joined back tables that is broadcast/shuffled throughout the DAG processing.
> Example query, tpc-ds query4
> {code}
> with year_total as (
>  select c_customer_id customer_id
>,c_first_name customer_first_name
>,c_last_name customer_last_name
>,c_preferred_cust_flag customer_preferred_cust_flag
>,c_birth_country customer_birth_country
>,c_login customer_login
>,c_email_address customer_email_address
>,d_year dyear
>
> ,sum(((ss_ext_list_price-ss_ext_wholesale_cost-ss_ext_discount_amt)+ss_ext_sales_price)/2)
>  year_total
>,'s' sale_type
>  from customer
>  ,store_sales
>  ,date_dim
>  where c_customer_sk = ss_customer_sk
>and ss_sold_date_sk = d_date_sk
>  group by c_customer_id
>  ,c_first_name
>  ,c_last

[jira] [Work logged] (HIVE-23493) Rewrite plan to join back tables with many projected columns joined multiple times

2020-06-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23493?focusedWorklogId=446513&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-446513
 ]

ASF GitHub Bot logged work on HIVE-23493:
-

Author: ASF GitHub Bot
Created on: 16/Jun/20 13:35
Start Date: 16/Jun/20 13:35
Worklog Time Spent: 10m 
  Work Description: kasakrisz opened a new pull request #1124:
URL: https://github.com/apache/hive/pull/1124


   Testing done:
   ```
   mvn test -Dtest.output.overwrite -DskipSparkTests 
-Dtest=TestTezPerfConstraintsCliDriver 
-Dqfile=cbo_query4.q,cbo_query11.q,cbo_query74.q,query4.q,query11.q,query74.q 
-pl itests/qtest -Pitests
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 446513)
Time Spent: 40m  (was: 0.5h)

> Rewrite plan to join back tables with many projected columns joined multiple 
> times
> --
>
> Key: HIVE-23493
> URL: https://issues.apache.org/jira/browse/HIVE-23493
> Project: Hive
>  Issue Type: New Feature
>  Components: CBO
>Reporter: Krisztian Kasa
>Assignee: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23493.1.patch
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Queries with a pattern where one or more tables joins with a fact table in a 
> CTE. Many columns are projected out those tables and then grouped in the CTE. 
>  The main query joins multiple instances of the CTE and may project a subset 
> of these.
> The optimization is to rewrite the CTE to include only key (PK, non null 
> Unique Key) columns and join the tables back to the resultset of the main 
> query to fetch the rest of the wide columns. This reduces the datasize of the 
> joined back tables that is broadcast/shuffled throughout the DAG processing.
> Example query, tpc-ds query4
> {code}
> with year_total as (
>  select c_customer_id customer_id
>,c_first_name customer_first_name
>,c_last_name customer_last_name
>,c_preferred_cust_flag customer_preferred_cust_flag
>,c_birth_country customer_birth_country
>,c_login customer_login
>,c_email_address customer_email_address
>,d_year dyear
>
> ,sum(((ss_ext_list_price-ss_ext_wholesale_cost-ss_ext_discount_amt)+ss_ext_sales_price)/2)
>  year_total
>,'s' sale_type
>  from customer
>  ,store_sales
>  ,date_dim
>  where c_customer_sk = ss_customer_sk
>and ss_sold_date_sk = d_date_sk
>  group by c_customer_id
>  ,c_first_name
>  ,c_last_name
>  ,c_preferred_cust_flag
>  ,c_birth_country
>  ,c_login
>  ,c_email_address
>  ,d_year
>  union all
>  select c_customer_id customer_id
>,c_first_name customer_first_name
>,c_last_name customer_last_name
>,c_preferred_cust_flag customer_preferred_cust_flag
>,c_birth_country customer_birth_country
>,c_login customer_login
>,c_email_address customer_email_address
>,d_year dyear
>
> ,sumcs_ext_list_price-cs_ext_wholesale_cost-cs_ext_discount_amt)+cs_ext_sales_price)/2)
>  ) year_total
>,'c' sale_type
>  from customer
>  ,catalog_sales
>  ,date_dim
>  where c_customer_sk = cs_bill_customer_sk
>and cs_sold_date_sk = d_date_sk
>  group by c_customer_id
>  ,c_first_name
>  ,c_last_name
>  ,c_preferred_cust_flag
>  ,c_birth_country
>  ,c_login
>  ,c_email_address
>  ,d_year
> union all
>  select c_customer_id customer_id
>,c_first_name customer_first_name
>,c_last_name customer_last_name
>,c_preferred_cust_flag customer_preferred_cust_flag
>,c_birth_country customer_birth_country
>,c_login customer_login
>,c_email_address customer_email_address
>,d_year dyear
>
> ,sumws_ext_list_price-ws_ext_wholesale_cost-ws_ext_discount_amt)+ws_ext_sales_price)/2)
>  ) year_total
>,'w' sale_type
>  from customer
>  ,web_sales
>  ,date_dim
>  where c_customer_sk = ws_bill_customer_sk
>and ws_sold_date_sk = d_date_sk
>  group by c_customer_id
>  ,c_first_name
>  ,c_last_name
>  ,c_preferred_cust_flag
>  ,c_birth_country
>  ,c_login
>  ,c_email_address
>  ,d_year
>  )
>   select  
>   t_s_secyear.customer_id
>  ,t_s_s

[jira] [Work logged] (HIVE-23493) Rewrite plan to join back tables with many projected columns joined multiple times

2020-06-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23493?focusedWorklogId=445639&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-445639
 ]

ASF GitHub Bot logged work on HIVE-23493:
-

Author: ASF GitHub Bot
Created on: 15/Jun/20 04:17
Start Date: 15/Jun/20 04:17
Worklog Time Spent: 10m 
  Work Description: kasakrisz commented on a change in pull request #1096:
URL: https://github.com/apache/hive/pull/1096#discussion_r439922505



##
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java
##
@@ -2389,6 +2390,11 @@ private RelNode applyPostJoinOrderingTransform(RelNode 
basePlan, RelMetadataProv
 
   final HepProgramBuilder program = new HepProgramBuilder();
 
+  if 
(conf.getBoolVar(ConfVars.HIVE_CARDINALITY_PRESERVING_JOIN_OPTIMIZATION)) {
+generatePartialProgram(program, false, HepMatchOrder.TOP_DOWN,
+new HiveCardinalityPreservingJoinRule());

Review comment:
   `HiveCardinalityPreservingJoinRule` is not stateless: it extends 
`HiveFieldTrimmerRule` which tracks whether it was triggered or not to avoid 
unnecessary multiple traversals.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 445639)
Time Spent: 0.5h  (was: 20m)

> Rewrite plan to join back tables with many projected columns joined multiple 
> times
> --
>
> Key: HIVE-23493
> URL: https://issues.apache.org/jira/browse/HIVE-23493
> Project: Hive
>  Issue Type: New Feature
>  Components: CBO
>Reporter: Krisztian Kasa
>Assignee: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23493.1.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Queries with a pattern where one or more tables joins with a fact table in a 
> CTE. Many columns are projected out those tables and then grouped in the CTE. 
>  The main query joins multiple instances of the CTE and may project a subset 
> of these.
> The optimization is to rewrite the CTE to include only key (PK, non null 
> Unique Key) columns and join the tables back to the resultset of the main 
> query to fetch the rest of the wide columns. This reduces the datasize of the 
> joined back tables that is broadcast/shuffled throughout the DAG processing.
> Example query, tpc-ds query4
> {code}
> with year_total as (
>  select c_customer_id customer_id
>,c_first_name customer_first_name
>,c_last_name customer_last_name
>,c_preferred_cust_flag customer_preferred_cust_flag
>,c_birth_country customer_birth_country
>,c_login customer_login
>,c_email_address customer_email_address
>,d_year dyear
>
> ,sum(((ss_ext_list_price-ss_ext_wholesale_cost-ss_ext_discount_amt)+ss_ext_sales_price)/2)
>  year_total
>,'s' sale_type
>  from customer
>  ,store_sales
>  ,date_dim
>  where c_customer_sk = ss_customer_sk
>and ss_sold_date_sk = d_date_sk
>  group by c_customer_id
>  ,c_first_name
>  ,c_last_name
>  ,c_preferred_cust_flag
>  ,c_birth_country
>  ,c_login
>  ,c_email_address
>  ,d_year
>  union all
>  select c_customer_id customer_id
>,c_first_name customer_first_name
>,c_last_name customer_last_name
>,c_preferred_cust_flag customer_preferred_cust_flag
>,c_birth_country customer_birth_country
>,c_login customer_login
>,c_email_address customer_email_address
>,d_year dyear
>
> ,sumcs_ext_list_price-cs_ext_wholesale_cost-cs_ext_discount_amt)+cs_ext_sales_price)/2)
>  ) year_total
>,'c' sale_type
>  from customer
>  ,catalog_sales
>  ,date_dim
>  where c_customer_sk = cs_bill_customer_sk
>and cs_sold_date_sk = d_date_sk
>  group by c_customer_id
>  ,c_first_name
>  ,c_last_name
>  ,c_preferred_cust_flag
>  ,c_birth_country
>  ,c_login
>  ,c_email_address
>  ,d_year
> union all
>  select c_customer_id customer_id
>,c_first_name customer_first_name
>,c_last_name customer_last_name
>,c_preferred_cust_flag customer_preferred_cust_flag
>,c_birth_country customer_birth_country
>,c_login customer_login
>,c_email_address customer_email_address
>,d_year dyear
>
> ,sumws_ext_list_price-ws_ext_wholesale_cost-ws_ext_discount_amt)+w

[jira] [Work logged] (HIVE-23493) Rewrite plan to join back tables with many projected columns joined multiple times

2020-06-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23493?focusedWorklogId=445622&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-445622
 ]

ASF GitHub Bot logged work on HIVE-23493:
-

Author: ASF GitHub Bot
Created on: 15/Jun/20 01:18
Start Date: 15/Jun/20 01:18
Worklog Time Spent: 10m 
  Work Description: jcamachor commented on a change in pull request #1096:
URL: https://github.com/apache/hive/pull/1096#discussion_r439890211



##
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java
##
@@ -2389,6 +2390,11 @@ private RelNode applyPostJoinOrderingTransform(RelNode 
basePlan, RelMetadataProv
 
   final HepProgramBuilder program = new HepProgramBuilder();
 
+  if 
(conf.getBoolVar(ConfVars.HIVE_CARDINALITY_PRESERVING_JOIN_OPTIMIZATION)) {
+generatePartialProgram(program, false, HepMatchOrder.TOP_DOWN,
+new HiveCardinalityPreservingJoinRule());

Review comment:
   nit. Create a static final instance and use it (as other rules do 
similarly).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 445622)
Time Spent: 20m  (was: 10m)

> Rewrite plan to join back tables with many projected columns joined multiple 
> times
> --
>
> Key: HIVE-23493
> URL: https://issues.apache.org/jira/browse/HIVE-23493
> Project: Hive
>  Issue Type: New Feature
>  Components: CBO
>Reporter: Krisztian Kasa
>Assignee: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23493.1.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Queries with a pattern where one or more tables joins with a fact table in a 
> CTE. Many columns are projected out those tables and then grouped in the CTE. 
>  The main query joins multiple instances of the CTE and may project a subset 
> of these.
> The optimization is to rewrite the CTE to include only key (PK, non null 
> Unique Key) columns and join the tables back to the resultset of the main 
> query to fetch the rest of the wide columns. This reduces the datasize of the 
> joined back tables that is broadcast/shuffled throughout the DAG processing.
> Example query, tpc-ds query4
> {code}
> with year_total as (
>  select c_customer_id customer_id
>,c_first_name customer_first_name
>,c_last_name customer_last_name
>,c_preferred_cust_flag customer_preferred_cust_flag
>,c_birth_country customer_birth_country
>,c_login customer_login
>,c_email_address customer_email_address
>,d_year dyear
>
> ,sum(((ss_ext_list_price-ss_ext_wholesale_cost-ss_ext_discount_amt)+ss_ext_sales_price)/2)
>  year_total
>,'s' sale_type
>  from customer
>  ,store_sales
>  ,date_dim
>  where c_customer_sk = ss_customer_sk
>and ss_sold_date_sk = d_date_sk
>  group by c_customer_id
>  ,c_first_name
>  ,c_last_name
>  ,c_preferred_cust_flag
>  ,c_birth_country
>  ,c_login
>  ,c_email_address
>  ,d_year
>  union all
>  select c_customer_id customer_id
>,c_first_name customer_first_name
>,c_last_name customer_last_name
>,c_preferred_cust_flag customer_preferred_cust_flag
>,c_birth_country customer_birth_country
>,c_login customer_login
>,c_email_address customer_email_address
>,d_year dyear
>
> ,sumcs_ext_list_price-cs_ext_wholesale_cost-cs_ext_discount_amt)+cs_ext_sales_price)/2)
>  ) year_total
>,'c' sale_type
>  from customer
>  ,catalog_sales
>  ,date_dim
>  where c_customer_sk = cs_bill_customer_sk
>and cs_sold_date_sk = d_date_sk
>  group by c_customer_id
>  ,c_first_name
>  ,c_last_name
>  ,c_preferred_cust_flag
>  ,c_birth_country
>  ,c_login
>  ,c_email_address
>  ,d_year
> union all
>  select c_customer_id customer_id
>,c_first_name customer_first_name
>,c_last_name customer_last_name
>,c_preferred_cust_flag customer_preferred_cust_flag
>,c_birth_country customer_birth_country
>,c_login customer_login
>,c_email_address customer_email_address
>,d_year dyear
>
> ,sumws_ext_list_price-ws_ext_wholesale_cost-ws_ext_discount_amt)+ws_ext_sales_price)/2)
>  ) year_total
>,'w' sale_type
>  from customer
>  ,web_sales
>  

[jira] [Work logged] (HIVE-23493) Rewrite plan to join back tables with many projected columns joined multiple times

2020-06-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23493?focusedWorklogId=444733&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-444733
 ]

ASF GitHub Bot logged work on HIVE-23493:
-

Author: ASF GitHub Bot
Created on: 12/Jun/20 06:26
Start Date: 12/Jun/20 06:26
Worklog Time Spent: 10m 
  Work Description: kasakrisz opened a new pull request #1096:
URL: https://github.com/apache/hive/pull/1096


   Testing done:
   ```
   mvn test -Dtest.output.overwrite -DskipSparkTests 
-Dtest=TestMiniLlapLocalCliDriver -Dqfile=cardinality_preserving_join_opt.q -pl 
itests/qtest -Pitests
   mvn test -Dtest.output.overwrite -DskipSparkTests 
-Dtest=TestTezPerfConstraintsCliDriver 
-Dqfile=cbo_query4.q,cbo_query11.q,cbo_query74.q,query4.q,query11.q,query74.q 
-pl itests/qtest -Pitests
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 444733)
Remaining Estimate: 0h
Time Spent: 10m

> Rewrite plan to join back tables with many projected columns joined multiple 
> times
> --
>
> Key: HIVE-23493
> URL: https://issues.apache.org/jira/browse/HIVE-23493
> Project: Hive
>  Issue Type: New Feature
>  Components: CBO
>Reporter: Krisztian Kasa
>Assignee: Krisztian Kasa
>Priority: Major
> Attachments: HIVE-23493.1.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Queries with a pattern where one or more tables joins with a fact table in a 
> CTE. Many columns are projected out those tables and then grouped in the CTE. 
>  The main query joins multiple instances of the CTE and may project a subset 
> of these.
> The optimization is to rewrite the CTE to include only key (PK, non null 
> Unique Key) columns and join the tables back to the resultset of the main 
> query to fetch the rest of the wide columns. This reduces the datasize of the 
> joined back tables that is broadcast/shuffled throughout the DAG processing.
> Example query, tpc-ds query4
> {code}
> with year_total as (
>  select c_customer_id customer_id
>,c_first_name customer_first_name
>,c_last_name customer_last_name
>,c_preferred_cust_flag customer_preferred_cust_flag
>,c_birth_country customer_birth_country
>,c_login customer_login
>,c_email_address customer_email_address
>,d_year dyear
>
> ,sum(((ss_ext_list_price-ss_ext_wholesale_cost-ss_ext_discount_amt)+ss_ext_sales_price)/2)
>  year_total
>,'s' sale_type
>  from customer
>  ,store_sales
>  ,date_dim
>  where c_customer_sk = ss_customer_sk
>and ss_sold_date_sk = d_date_sk
>  group by c_customer_id
>  ,c_first_name
>  ,c_last_name
>  ,c_preferred_cust_flag
>  ,c_birth_country
>  ,c_login
>  ,c_email_address
>  ,d_year
>  union all
>  select c_customer_id customer_id
>,c_first_name customer_first_name
>,c_last_name customer_last_name
>,c_preferred_cust_flag customer_preferred_cust_flag
>,c_birth_country customer_birth_country
>,c_login customer_login
>,c_email_address customer_email_address
>,d_year dyear
>
> ,sumcs_ext_list_price-cs_ext_wholesale_cost-cs_ext_discount_amt)+cs_ext_sales_price)/2)
>  ) year_total
>,'c' sale_type
>  from customer
>  ,catalog_sales
>  ,date_dim
>  where c_customer_sk = cs_bill_customer_sk
>and cs_sold_date_sk = d_date_sk
>  group by c_customer_id
>  ,c_first_name
>  ,c_last_name
>  ,c_preferred_cust_flag
>  ,c_birth_country
>  ,c_login
>  ,c_email_address
>  ,d_year
> union all
>  select c_customer_id customer_id
>,c_first_name customer_first_name
>,c_last_name customer_last_name
>,c_preferred_cust_flag customer_preferred_cust_flag
>,c_birth_country customer_birth_country
>,c_login customer_login
>,c_email_address customer_email_address
>,d_year dyear
>
> ,sumws_ext_list_price-ws_ext_wholesale_cost-ws_ext_discount_amt)+ws_ext_sales_price)/2)
>  ) year_total
>,'w' sale_type
>  from customer
>  ,web_sales
>  ,date_dim
>  where c_customer_sk = ws_bill_customer_sk
>and ws_sold_date_sk = d_date_sk
>  group by c_customer_id
>  ,c_first_name
>  ,c_last_name
>  ,c_preferred_cust_flag
>  ,c_birth_country
>  ,c_login
>