[jira] [Commented] (HIVE-20032) Don't serialize hashCode when groupByShuffle and RDD cacheing is disabled

Sahil Takiar (JIRA) Thu, 12 Jul 2018 19:00:14 -0700


    [ 
https://issues.apache.org/jira/browse/HIVE-20032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542413#comment-16542413
 ]


Sahil Takiar commented on HIVE-20032:
-------------------------------------

As for benchmarking, I have done a lot of TPC-DS benchmarking, and I don't 
consistently get better performance. However, the amount of shuffled data is 
significantly reduced (as well as the amount of data spilled to disk). My guess 
is that latency doesn't improve much because I'm running my tests on a unloaded 
cluster. However, I expect cluster throughput to be better with this patch 
since less I/O resources are being used. I'll need to run some concurrent 
TPC-DS workloads to confirm this though.

> Don't serialize hashCode when groupByShuffle and RDD cacheing is disabled
> -------------------------------------------------------------------------
>
>                 Key: HIVE-20032
>                 URL: https://issues.apache.org/jira/browse/HIVE-20032
>             Project: Hive
>          Issue Type: Improvement
>          Components: Spark
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>            Priority: Major
>         Attachments: HIVE-20032.1.patch, HIVE-20032.2.patch, 
> HIVE-20032.3.patch, HIVE-20032.4.patch
>
>
> Follow up on HIVE-15104, if we don't enable RDD cacheing or groupByShuffles, 
> then we don't need to serialize the hashCode when shuffling data in HoS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-20032) Don't serialize hashCode when groupByShuffle and RDD cacheing is disabled

Reply via email to