[ https://issues.apache.org/jira/browse/DATAFU-47?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Philip (flip) Kromer updated DATAFU-47: --------------------------------------- Attachment: 0001-UDF-for-Murmur3-and-other-Hash-functions.patch Sorry had to prep the patch. Can also supply the string 'rand' as a seed to the seedable hashes. This gives you a random sequence on every run without depending heavily on the random number generator. > UDF for Murmur3 (and other) Hash functions > ------------------------------------------ > > Key: DATAFU-47 > URL: https://issues.apache.org/jira/browse/DATAFU-47 > Project: DataFu > Issue Type: Improvement > Reporter: Philip (flip) Kromer > Labels: Guava, Hash, UDF > Attachments: 0001-UDF-for-Murmur3-and-other-Hash-functions.patch > > > Datafu should offer the murmur3 hash. > The attached patch uses Guava to add murmur3 (and also handles md5 and sha*). > If you're willing to upgrade Guava, we'd get sip24 (a fast cryptographically > secure hash), crc32 (in case you want that) and adler32 (another > checksum-style hash). > From the javadoc: > * 'murmur3-32', [optional seed] or 'murmur3-128', [optional seed]: Returns a > [murmur3 hash|https://code.google.com/p/smhasher/] of the given length. > Murmur3 is fast, with has exceptionally good statistical properties; it's a > good choice if all you need is good mixing of the inputs. It is _not_ > cryptographically secure; that is, given an output value from murmur3, there > are efficient algorithms to find an input yielding the same output value. > Supply the seed as a string that > [Integer.decode|http://docs.oracle.com/javase/7/docs/api/java/lang/Integer.html#decode(java.lang.String)] > can handle. > * 'good-{integer number of bits}': Returns a general-purpose, > non-cryptographic-strength, streaming hash function that produces hash codes > of length at least minimumBits. Users without specific compatibility > requirements and who do not persist the hash codes are encouraged to choose > this hash function. (Cryptographers, like dieticians and fashionistas, > occasionally realize that We've Been Doing it Wrong This Whole Time. Using > 'good-*' lets you track What the Experts From (Milan|NIH|IEEE) Say To > (Wear|Eat|Hash With) this Fall. Therefore, expect values returned with this > choice may change version-to-version.) > * 'md5': Returns an MD5 hash (128 hash bits) using Java's MD5 > MessageDigest. > * 'sha1': Returns a SHA-1 hash (160 hash bits) using Java's SHA-1 > MessageDigest. > * 'sha256': Returns a SHA-256 hash (256 hash bits) using Java's SHA-256 > MessageDigest. > * 'sha512': Returns a SHA-512 hash (160 hash bits) using Java's SHA-512 > MessageDigest. > Examples: > {code} > define DefaultH datafu.pig.hash.Hasher(); > define GoodH datafu.pig.hash.Hasher('good-32'); > define BetterH datafu.pig.hash.Hasher('good-127'); > define MurmurH32 datafu.pig.hash.Hasher('murmur3-32'); > define MurmurH32A datafu.pig.hash.Hasher('murmur3-32', '0x0'); > define MurmurH32B datafu.pig.hash.Hasher('murmur3-32', '0x56789abc'); > define MurmurH128 datafu.pig.hash.Hasher('murmur3-128'); > define MurmurH128A datafu.pig.hash.Hasher('murmur3-128', '0x0'); > define MurmurH128B datafu.pig.hash.Hasher('murmur3-128', '-12345678'); > define MD5H datafu.pig.hash.Hasher('md5'); > define SHA1H datafu.pig.hash.Hasher('sha1'); > define SHA256H datafu.pig.hash.Hasher('sha256'); > define SHA512H datafu.pig.hash.Hasher('sha512'); > > data_in = LOAD 'input' as (val:chararray); > > data_out = FOREACH data_in GENERATE > DefaultH(val), GoodH(val), BetterH(val), > MurmurH32(val), MurmurH32A(val), MurmurH32B(val), > MurmurH128(val), MurmurH128A(val), MurmurH128B(val), > SHA1H(val), SHA256H(val), SHA512H(val), > MD5H(val) > ; > STORE data_out INTO 'output'; > {code} > In practice: > {code} > -- Consistent shuffle of large dataset with only one full-table reduce > step. > -- Every pig run with the same seed will generate sorted output in the same > order > define MurmurH32 datafu.pig.hash.Hasher('murmur3-32'); > -- Force each file to go in whole to a single mapper (or in the LOAD use > -tagSplit, to be added in future Pig version) > SET mapred.max.split.size 1099511627776; > -- -tagPath option labels each file > data_in = LOAD 'input' USING PigStorage('\t', '-tagPath') AS > (path:chararray, val:chararray); > data_numbered = RANK data_in; > data_ided = FOREACH numbered GENERATE > MurmurH32(CONCAT((chararray)path, '#', (chararray)rank_data_in)) AS > shuffle_key, > val AS val; > data_shuffled = FOREACH (ORDER data_ided BY shuffle_key) GENERATE val; > STORE data_shuffled INTO 'data_shuffled'; > {code} > Important notes about this patch: > * It should be applied _after_ the patch for DATAFU-46 > * It expands the dependence on Guava. Does [pull req > 75|https://github.com/linkedin/datafu/pull/75] mean there's momentum to > de-Guava datafu? > * The patch has (commented out) code that shows what life would be like if > the sip24, crc32 and adler32 hashes were available. On your advice, I will > either (a) put in a patch removing the spurious comments or (b) file a > separate bug to update guava, push in a patch for that, and put in a patch > restoring to glory the extra hashes. -- This message was sent by Atlassian JIRA (v6.2#6252)