[ 
https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15889645#comment-15889645
 ] 

Tejas Patil commented on SPARK-17495:
-------------------------------------

>> Is it possible to figure out the hashing function based on file names? 

The way datasource files are named is different than hive so this will work. I 
was thinking about more simpler way: use hive-hash only when writing to hive 
bucketed tables. Since Spark doesn't support hive bucketing at the moment, any 
old data would be generated has to be from Hive .... so, this will not cause 
breakages for users.

>> 3. In general it'd be useful to allow users to configure which actual hash 
>> function "hash" maps to. This can be a dynamic config.

For any operations related to hive bucketed tables, we should not let users be 
able to change the hashing function and do the right thing underneath. Else, 
users can shoot themselves in foot (eg. joining two hive tables both bucketed 
but using different hashing function). One option was to store the hashing 
function used to populate a table in metastore. But this won't be compatible 
with Hive and mess things in environments where both Spark and Hive is used 
together.

As for simple `hash()` UDF / function is concerned, I am a bit conservative 
about adding a dynamic config as I feel it might cause problems. Say you start 
off a session with default murmur3 hash and compute some data, cache it. Later 
on user switches to use hive hash and reusing the cached data as-is now wont be 
right thing to do. Keeping it static for a session would save from such problem.

> Hive hash implementation
> ------------------------
>
>                 Key: SPARK-17495
>                 URL: https://issues.apache.org/jira/browse/SPARK-17495
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Tejas Patil
>            Assignee: Tejas Patil
>            Priority: Minor
>             Fix For: 2.2.0
>
>
> Spark internally uses Murmur3Hash for partitioning. This is different from 
> the one used by Hive. For queries which use bucketing this leads to different 
> results if one tries the same query on both engines. For us, we want users to 
> have backward compatibility to that one can switch parts of applications 
> across the engines without observing regressions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to