[jira] [Comment Edited] (HIVE-16923) Hive-on-Spark DPP Improvements

Sahil Takiar (JIRA) Tue, 20 Jun 2017 15:20:18 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-16923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16056570#comment-16056570
 ]


Sahil Takiar edited comment on HIVE-16923 at 6/20/17 10:19 PM:
---------------------------------------------------------------

Will post a design doc soon.

Two of the biggest limitations of the current DPP implementation are that it 
requires an additional Spark job and it requires writing some intermediate data 
to HDFS. We should evaluate the overhead of these limitations and if its 
possible to remove them.

Ideally, DPP shouldn't hurt performance for any query. One way to ensure this 
is to build some type of cost-based model that predicts whether or not DPP will 
help perf or not. For example, a simple cost-based model could simply enable 
DPP for map-joins only. Since map-joins already require two Spark jobs and 
writing intermediate data to HDFS, there shouldn't be significant overhead to 
running DPP with a map-join.


was (Author: stakiar):
Will post a design doc soon.

Two of the biggest limitations of the current DPP implementation are that it 
requires an additional Spark job and it requires writing some intermediate data 
to HDFS.

Ideally, DPP shouldn't hurt performance for any query. One way to ensure this 
is to build some type of cost-based model that predicts whether or not DPP will 
help perf or not. For example, a simple cost-based model could simply enable 
DPP for map-joins only. Since map-joins already require two Spark jobs and 
writing intermediate data to HDFS, there shouldn't be significant overhead to 
running DPP with a map-join.

> Hive-on-Spark DPP Improvements
> ------------------------------
>
>                 Key: HIVE-16923
>                 URL: https://issues.apache.org/jira/browse/HIVE-16923
>             Project: Hive
>          Issue Type: Bug
>          Components: Spark
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>
> Improvements to Hive-on-Spark DPP so that it is production ready.
> Hive-on-Spark DPP was implemented in HIVE-9152. However, it is disabled by 
> default. The goal of this JIRA is to improve the DPP implementation so that 
> it can be enabled by default.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (HIVE-16923) Hive-on-Spark DPP Improvements

Reply via email to