[jira] [Commented] (SPARK-17495) Hive hash implementation

Tejas Patil (JIRA) Thu, 23 Feb 2017 23:51:07 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15882161#comment-15882161
 ]


Tejas Patil commented on SPARK-17495:
-------------------------------------

I am looking into using hive-hash when `hash()` in called in a hive context. 
Before jumping to a PR, wanted to discuss what model we should use.

Currently doing `hash()` in SQL uses murmur3. For anyone porting from Hive to 
Spark, this will give different results. 
- One easy thing to do is to replace the `hash` impl from `FunctionRegistry` 
for Hive enabled context. Downside: There can be users who can create hive 
enabled context but still operate over spark native tables. Using hive-hash is 
not something they want.
- Its hard to detect if a given query result will be written to hive / spark 
native table. eg. one could cache / persist and later choose to write the 
output to both hive table and spark native table. We could push this decision 
making to users by adding a config to use hive-hash. Note that this need to be 
a static config only allowed to set when the session is created. Letting users 
flip the config in middle of a session is risky as it can lead to undesired 
outputs.

Am open to comments about these two options. Unless there are any objections, 
will move forward with 2nd approach of using a config.


> Hive hash implementation
> ------------------------
>
>                 Key: SPARK-17495
>                 URL: https://issues.apache.org/jira/browse/SPARK-17495
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Tejas Patil
>            Assignee: Tejas Patil
>            Priority: Minor
>
> Spark internally uses Murmur3Hash for partitioning. This is different from 
> the one used by Hive. For queries which use bucketing this leads to different 
> results if one tries the same query on both engines. For us, we want users to 
> have backward compatibility to that one can switch parts of applications 
> across the engines without observing regressions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17495) Hive hash implementation

Reply via email to