[ https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15882161#comment-15882161 ]
Tejas Patil commented on SPARK-17495: ------------------------------------- I am looking into using hive-hash when `hash()` in called in a hive context. Before jumping to a PR, wanted to discuss what model we should use. Currently doing `hash()` in SQL uses murmur3. For anyone porting from Hive to Spark, this will give different results. - One easy thing to do is to replace the `hash` impl from `FunctionRegistry` for Hive enabled context. Downside: There can be users who can create hive enabled context but still operate over spark native tables. Using hive-hash is not something they want. - Its hard to detect if a given query result will be written to hive / spark native table. eg. one could cache / persist and later choose to write the output to both hive table and spark native table. We could push this decision making to users by adding a config to use hive-hash. Note that this need to be a static config only allowed to set when the session is created. Letting users flip the config in middle of a session is risky as it can lead to undesired outputs. Am open to comments about these two options. Unless there are any objections, will move forward with 2nd approach of using a config. > Hive hash implementation > ------------------------ > > Key: SPARK-17495 > URL: https://issues.apache.org/jira/browse/SPARK-17495 > Project: Spark > Issue Type: Sub-task > Components: SQL > Reporter: Tejas Patil > Assignee: Tejas Patil > Priority: Minor > > Spark internally uses Murmur3Hash for partitioning. This is different from > the one used by Hive. For queries which use bucketing this leads to different > results if one tries the same query on both engines. For us, we want users to > have backward compatibility to that one can switch parts of applications > across the engines without observing regressions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org