[ 
https://issues.apache.org/jira/browse/SPARK-45900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Holland updated SPARK-45900:
-----------------------------------
    Description: 
I often work in projects that require deterministic randomness, especially when 
creating surrogate keys. For small volumes of data xxhash64 works well however 
this functionality doesn't scale well - with a 64-bit hash code, the chance of 
collision is one in a million when you hash just six million items increasing 
sharply due to the birthday paradox.

Currently there are a few ways to handle this
 - hash: 32-bit output (>50% chance of collision at least one for tables larger 
than 77000 rows)
 - xxhash64: 64-bit output (>50% chance of collision at least one for tables 
larger than 5 billion rows)
 - shaXXX/md5: single binary column input, string output, quite computationally 
expensive.

I'd suggest adding the newest algorithm in the xxhash64 family, XXH3. The XXH3 
family is a modern 64-bit and 128-bit hash function family that provides 
improved strength and performance across the board.

I'd imagine this would be a new function named xxhash3 and take 64 bit, and 
128bit bit lengths. For usability I believe the bit length should default to 
128bits to provide the best experience to reduce accidental collisions and 
leave users to set the bit length to 64 as an override if they need to for 
additional performance or interop reasons. (given the benchmarks, this would 
likely be quite rare)

References:
 * [Documentation|https://xxhash.com/]
 * xxHash64 Ticket (https://issues.apache.org/jira/browse/SPARK-27099)
 * [Existing xxHash64 
logic|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/XXH64.java]

  was:
I often work in projects that require deterministic randomness, especially when 
creating surrogate keys. For small volumes of data xxhash64 works well however 
this functionality doesn't scale well - with a 64-bit hash code, the chance of 
collision is one in a million when you hash just six million items increasing 
sharply due to the birthday paradox.

Currently there are a few ways to handle this
 - hash: 32-bit output (>50% chance of collision at least one for tables larger 
than 77000 rows)
 - xxhash64: 64-bit output (>50% chance of collision at least one for tables 
larger than 5 billion rows)
 - shaXXX/md5: single binary column input, string output, quite computationally 
expensive.

I'd suggest adding the newest algorithm in the xxhash64 family, XXH3. The XXH3 
family is a modern 64-bit and 128-bit hash function family that provides 
improved strength and performance across the board.

I'd imagine this would be a new function named xxhash3 and take 64 bit, and 
128bit bit lengths. For usability I believe the bit length should default to 
128bits to provide the best experience to reduce accidental collisions and 
leave users to set the bit length to 64 as an override if they need to for 
additional performance or interop reasons. (given the benchmarks, this would 
likely be quite rare)

References:
 * [Documentation|https://xxhash.com/]
 * xxHash64 Ticket (https://issues.apache.org/jira/browse/SPARK-45900)
 * [Existing xxHash64 
logic|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/XXH64.java]


> Expand hash functionalities from to include XXH3
> ------------------------------------------------
>
>                 Key: SPARK-45900
>                 URL: https://issues.apache.org/jira/browse/SPARK-45900
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.5.0
>            Reporter: Nathan Holland
>            Priority: Major
>
> I often work in projects that require deterministic randomness, especially 
> when creating surrogate keys. For small volumes of data xxhash64 works well 
> however this functionality doesn't scale well - with a 64-bit hash code, the 
> chance of collision is one in a million when you hash just six million items 
> increasing sharply due to the birthday paradox.
> Currently there are a few ways to handle this
>  - hash: 32-bit output (>50% chance of collision at least one for tables 
> larger than 77000 rows)
>  - xxhash64: 64-bit output (>50% chance of collision at least one for tables 
> larger than 5 billion rows)
>  - shaXXX/md5: single binary column input, string output, quite 
> computationally expensive.
> I'd suggest adding the newest algorithm in the xxhash64 family, XXH3. The 
> XXH3 family is a modern 64-bit and 128-bit hash function family that provides 
> improved strength and performance across the board.
> I'd imagine this would be a new function named xxhash3 and take 64 bit, and 
> 128bit bit lengths. For usability I believe the bit length should default to 
> 128bits to provide the best experience to reduce accidental collisions and 
> leave users to set the bit length to 64 as an override if they need to for 
> additional performance or interop reasons. (given the benchmarks, this would 
> likely be quite rare)
> References:
>  * [Documentation|https://xxhash.com/]
>  * xxHash64 Ticket (https://issues.apache.org/jira/browse/SPARK-27099)
>  * [Existing xxHash64 
> logic|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/XXH64.java]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to