[jira] [Commented] (SPARK-17495) Hive hash implementation

Tejas Patil (JIRA) Tue, 04 Apr 2017 23:07:40 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15956353#comment-15956353
 ]


Tejas Patil commented on SPARK-17495:
-------------------------------------

[~dricard] : this does not intend to break any existing behaviour especially 
for non-hive based workloads. In case of the udf `hash()`, if someone invokes 
it via Hive vs Spark's support for hive, the outputs are not in sync which will 
be fixed by this jira. Also, the main motivation for doing this is to be able 
to support hive bucketing in Spark which relies on hive hash.

> Hive hash implementation
> ------------------------
>
>                 Key: SPARK-17495
>                 URL: https://issues.apache.org/jira/browse/SPARK-17495
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Tejas Patil
>            Assignee: Tejas Patil
>            Priority: Minor
>             Fix For: 2.2.0
>
>
> Spark internally uses Murmur3Hash for partitioning. This is different from 
> the one used by Hive. For queries which use bucketing this leads to different 
> results if one tries the same query on both engines. For us, we want users to 
> have backward compatibility to that one can switch parts of applications 
> across the engines without observing regressions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17495) Hive hash implementation

Reply via email to