[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

Rui Li (JIRA) Tue, 06 Mar 2018 18:38:16 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388915#comment-16388915
 ]


Rui Li commented on HIVE-15104:
-------------------------------

[~stakiar], thanks for trying this out.
bq. The HiveKryoRegistrator still seems to be serializing the hashCode so where 
are the actual savings coming from?
I didn't look deeply into kryo, but I think the reason is generic kryo SerDe 
has some overhead to store class meta info, while 
 in {{HiveKryoRegistrator}} we just store the data. My earlier 
[comment|https://issues.apache.org/jira/browse/HIVE-15104?focusedCommentId=16007788&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16007788]
 shows custom SerDe can bring improvements for BytesWritable too.

bq. I'm not sure I understand why the performance should improve when 
hive.spark.use.groupby.shuffle is set to false.
I guess the difference is due to the different shuffle we used -- if 
{{hive.spark.use.groupby.shuffle}} is false, group-by-key shuffle is replaced 
with repartition-and-sort-within-partition shuffle. And yes, the registrator is 
same for the two cases.

bq. why do we need the hashCode after deserializing the data?
For MR, the hash code is not needed for deserialized HiveKey (see 
HiveKey::hashCode), because when HiveKey is deserialized, it's already been 
distributed to the proper reducer. For Spark, RDDs may get cached during the 
execution. So if we deserialize a cached RDD and try to partition it to a 
downstream reducer, we'll need the hash code available after deserialization.

> Hive on Spark generate more shuffle data than hive on mr
> --------------------------------------------------------
>
>                 Key: HIVE-15104
>                 URL: https://issues.apache.org/jira/browse/HIVE-15104
>             Project: Hive
>          Issue Type: Bug
>          Components: Spark
>    Affects Versions: 1.2.1
>            Reporter: wangwenli
>            Assignee: Rui Li
>            Priority: Major
>             Fix For: 3.0.0
>
>         Attachments: HIVE-15104.1.patch, HIVE-15104.10.patch, 
> HIVE-15104.2.patch, HIVE-15104.3.patch, HIVE-15104.4.patch, 
> HIVE-15104.5.patch, HIVE-15104.6.patch, HIVE-15104.7.patch, 
> HIVE-15104.8.patch, HIVE-15104.9.patch, TPC-H 100G.xlsx
>
>
> the same sql,  running on spark  and mr engine, will generate different size 
> of shuffle data.
> i think it is because of hive on mr just serialize part of HiveKey, but hive 
> on spark which using kryo will serialize full of Hivekey object.  
> what is your opionion?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

Reply via email to