Philip (flip) Kromer created DATAFU-47:
------------------------------------------
Summary: UDF for Murmur3 (and other) Hash functions
Key: DATAFU-47
URL: https://issues.apache.org/jira/browse/DATAFU-47
Project: DataFu
Issue Type: Improvement
Reporter: Philip (flip) Kromer
Datafu should offer the murmur3 hash.
The attached patch uses Guava to add murmur3 (and also handles md5 and sha*).
If you're willing to upgrade Guava, we'd get sip24 (a fast cryptographically
secure hash), crc32 (in case you want that) and adler32 (another checksum-style
hash).
>From the javadoc:
* 'murmur3-32', [optional seed] or 'murmur3-128', [optional seed]: Returns a
[murmur3 hash|https://code.google.com/p/smhasher/] of the given length. Murmur3
is fast, with has exceptionally good statistical properties; it's a good choice
if all you need is good mixing of the inputs. It is _not_ cryptographically
secure; that is, given an output value from murmur3, there are efficient
algorithms to find an input yielding the same output value. Supply the seed as
a string that
[Integer.decode|http://docs.oracle.com/javase/7/docs/api/java/lang/Integer.html#decode(java.lang.String)]
can handle.
* 'good-{integer number of bits}': Returns a general-purpose,
non-cryptographic-strength, streaming hash function that produces hash codes of
length at least minimumBits. Users without specific compatibility requirements
and who do not persist the hash codes are encouraged to choose this hash
function. (Cryptographers, like dieticians and fashionistas, occasionally
realize that We've Been Doing it Wrong This Whole Time. Using 'good-*' lets you
track What the Experts From (Milan|NIH|IEEE) Say To (Wear|Eat|Hash With) this
Fall. Therefore, expect values returned with this choice may change
version-to-version.)
* 'md5': Returns an MD5 hash (128 hash bits) using Java's MD5 MessageDigest.
* 'sha1': Returns a SHA-1 hash (160 hash bits) using Java's SHA-1
MessageDigest.
* 'sha256': Returns a SHA-256 hash (256 hash bits) using Java's SHA-256
MessageDigest.
* 'sha512': Returns a SHA-512 hash (160 hash bits) using Java's SHA-512
MessageDigest.
Examples:
{code}
datafu.pig.hash.Hasher('murmur3-32', '0x56789abc');
datafu.pig.hash.Hasher('murmur3-32', '-12345678');
{code}
Important notes about this patch:
* It should be applied _after_ the patch for DATAFU-46
* It expands the dependence on Guava. Does [pull req
75|https://github.com/linkedin/datafu/pull/75] mean there's momentum to
de-Guava datafu?
* The patch has (commented out) code that shows what life would be like if the
sip24, crc32 and adler32 hashes were available. On your advice, I will either
(a) put in a patch removing the spurious comments or (b) file a separate bug to
update guava, push in a patch for that, and put in a patch restoring to glory
the extra hashes.
--
This message was sent by Atlassian JIRA
(v6.2#6252)