[ 
https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15883394#comment-15883394
 ] 

Reynold Xin commented on SPARK-17495:
-------------------------------------

Let me put some thoughts here .... Please let me know if I missed anything:

1. On the read side we shouldn't care which hash function to use. All we need 
to know is that the data is hash partitioned by some hash function, and that 
should be sufficient to remove the shuffle needed in aggregation or join.

2. On the write side it does matter. In this case if we are writing to a Hive 
bucketed table, the Hive hash function should be used. Otherwise a Spark hash 
function should be used. This can perhaps be an option in the writer interface, 
and automatically populated for catalog tables based on what kind of table it 
is.

3. In general it'd be useful to allow users to configure which actual hash 
function "hash" maps to. This can be a dynamic config.





> Hive hash implementation
> ------------------------
>
>                 Key: SPARK-17495
>                 URL: https://issues.apache.org/jira/browse/SPARK-17495
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Tejas Patil
>            Assignee: Tejas Patil
>            Priority: Minor
>             Fix For: 2.2.0
>
>
> Spark internally uses Murmur3Hash for partitioning. This is different from 
> the one used by Hive. For queries which use bucketing this leads to different 
> results if one tries the same query on both engines. For us, we want users to 
> have backward compatibility to that one can switch parts of applications 
> across the engines without observing regressions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to