[ 
https://issues.apache.org/jira/browse/DATAFU-47?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001262#comment-14001262
 ] 

Philip (flip) Kromer commented on DATAFU-47:
--------------------------------------------

Patch should be applied on top of DATAFU-46

> UDF for Murmur3 (and other) Hash functions
> ------------------------------------------
>
>                 Key: DATAFU-47
>                 URL: https://issues.apache.org/jira/browse/DATAFU-47
>             Project: DataFu
>          Issue Type: Improvement
>            Reporter: Philip (flip) Kromer
>              Labels: Guava, Hash, UDF
>
> Datafu should offer the murmur3 hash.
> The attached patch uses Guava to add murmur3 (and also handles md5 and sha*). 
> If you're willing to upgrade Guava, we'd get sip24 (a fast cryptographically 
> secure hash), crc32 (in case you want that) and adler32 (another 
> checksum-style hash). 
> From the javadoc:
> * 'murmur3-32', [optional seed] or 'murmur3-128', [optional seed]: Returns a 
> [murmur3 hash|https://code.google.com/p/smhasher/] of the given length. 
> Murmur3 is fast, with has exceptionally good statistical properties; it's a 
> good choice if all you need is good mixing of the inputs. It is _not_ 
> cryptographically secure; that is, given an  output value from murmur3, there 
> are efficient algorithms to find an input yielding the same output value. 
> Supply the seed as a string that 
> [Integer.decode|http://docs.oracle.com/javase/7/docs/api/java/lang/Integer.html#decode(java.lang.String)]
>  can handle.
> * 'good-{integer number of bits}': Returns a general-purpose, 
> non-cryptographic-strength, streaming hash function that produces hash codes 
> of length at least minimumBits. Users without specific compatibility 
> requirements and who do not persist the hash codes are encouraged to choose 
> this hash function. (Cryptographers, like dieticians and fashionistas, 
> occasionally realize that We've Been Doing it Wrong This Whole Time. Using 
> 'good-*' lets you track What the Experts From (Milan|NIH|IEEE) Say To 
> (Wear|Eat|Hash With) this Fall. Therefore, expect values returned with this 
> choice may change version-to-version.)
> * 'md5':     Returns an MD5 hash (128 hash bits) using Java's MD5 
> MessageDigest.
> * 'sha1':    Returns a SHA-1 hash (160 hash bits) using Java's SHA-1 
> MessageDigest.
> * 'sha256':  Returns a SHA-256 hash (256 hash bits) using Java's SHA-256 
> MessageDigest.
> * 'sha512':  Returns a SHA-512 hash (160 hash bits) using Java's SHA-512 
> MessageDigest.
> Examples: 
> {code}
> datafu.pig.hash.Hasher('murmur3-32', '0x56789abc');
> datafu.pig.hash.Hasher('murmur3-32', '-12345678');
> {code}
> Important notes about this patch:
> * It should be applied _after_ the patch for DATAFU-46
> * It expands the dependence on Guava. Does [pull req 
> 75|https://github.com/linkedin/datafu/pull/75] mean there's momentum to 
> de-Guava datafu?
> * The patch has (commented out) code that shows what life would be like if 
> the sip24, crc32 and adler32 hashes were available. On your advice, I will 
> either (a) put in a patch removing the spurious comments or (b) file a 
> separate bug to update guava, push in a patch for that, and put in a patch 
> restoring to glory the extra hashes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to