Rather than calling it hash64, it'd be better to just call it xxhash64. The 
reason being ten years from now, we probably would look back and laugh at a 
specific hash implementation. It'd be better to just name the expression what 
it is.

On Wed, Mar 06, 2019 at 7:59 PM, < huon.wil...@data61.csiro.au > wrote:

> 
> 
> 
> Hi,
> 
> 
> 
> I’m working on something that requires deterministic randomness, i.e. a
> row gets the same “random” value no matter the order of the DataFrame. A
> seeded hash seems to be the perfect way to do this, but the existing
> hashes have various limitations:
> 
> 
> 
> - hash: 32-bit output (only 4 billion possibilities will result in a lot
> of collisions for many tables: the birthday paradox implies >50% chance of
> at least one for tables larger than 77000 rows, and likely ~1.6 billion
> collisions in a table of size 4 billion)
> - sha1/sha2/md5: single binary column input, string output
> 
> 
> 
> It seems there’s already support for a 64-bit hash function that can work
> with an arbitrary number of arbitrary-typed columns (XxHash64), and
> exposing this for DataFrames seems like it’s essentially one line in
> sql/functions.scala to match `hash` (plus docs, tests, function registry
> etc.):
> 
> 
> 
> def hash64(cols: Column*): Column = withExpr { new
> XxHash64(cols.map(_.expr)) }
> 
> 
> 
> For my use case, this can then be used to get a 64-bit “random” column
> like
> 
> 
> 
> val seed = rng.nextLong()
> hash64(lit(seed), col1, col2)
> 
> 
> 
> I’ve created a (hopefully) complete patch by mimicking ‘hash’ at https:/ /
> github. com/ apache/ spark/ compare/ master... huonw:hash64 (
> https://github.com/apache/spark/compare/master...huonw:hash64 ) ; should I
> open a JIRA and submit it as a pull request?
> 
> 
> 
> Additionally, both hash and the new hash64 already have support for being
> seeded, but this isn’t exposed directly and instead requires something
> like the `lit` above. Would it make sense to add overloads like the
> following?
> 
> 
> 
> def hash(seed: Int, cols: Columns*) = …
> def hash64(seed: Long, cols: Columns*) = …
> 
> 
> 
> Though, it does seem a bit unfortunate to be forced to pass the seed
> first.
> 
> 
> 
> (I sent this email to user@ spark. apache. org ( u...@spark.apache.org ) a
> few days ago, but didn't get any discussion about the Spark aspects of
> this, so I'm resending it here; I apologise in advance if I'm breaking
> protocol!)
> 
> 
> 
> - Huon Wilson
> 
> 
> 
> --------------------------------------------------------------------- To
> unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscr...@spark.apache.org )
> 
> 
>

Reply via email to