[ https://issues.apache.org/jira/browse/DATAFU-47?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Eyal Allweil updated DATAFU-47: ------------------------------- Attachment: DATAFU-47-new.patch I looked at the review board for this issue, and fixed the merge conflicts in HashTests and addressed the comments that were left. It depends on DATAFU-50, which was reopened, but I put a new patch there so that we can proceed with both. Since I didn't create the review, I can't upload a new diff there, but I've attached it to the Jira issue, and commented in the review board where appropriate. Tests pass, and I've run the content of "hasherTest" on a cluster using the assembled DataFu jar to make sure that the autojarring of the new Guava version works properly. I'll respond to the review board comments later. > UDF for Murmur3 (and other) Hash functions > ------------------------------------------ > > Key: DATAFU-47 > URL: https://issues.apache.org/jira/browse/DATAFU-47 > Project: DataFu > Issue Type: Improvement > Reporter: Philip (flip) Kromer > Labels: Guava, Hash, UDF > Attachments: > 0001-DATAFU-47-UDF-for-Murmur3-SipHash-2-4-and-other-Hash-functions.patch, > 0001-UDF-for-Murmur3-and-other-Hash-functions.patch, DATAFU-47-new.patch > > > Datafu should offer the murmur3 hash. > The attached patch uses Guava to add murmur3 (a fast hash with good > statistical properties), SipHash-2-4 (a fast cryptographically secure hash), > crc32, adler32, md5 and sha. > From the javadoc: > * 'murmur3-32', [optional seed] or 'murmur3-128', [optional seed]: Returns a > [murmur3 hash|https://code.google.com/p/smhasher/] of the given length. > Murmur3 is fast, with has exceptionally good statistical properties; it's a > good choice if all you need is good mixing of the inputs. It is _not_ > cryptographically secure; that is, given an output value from murmur3, there > are efficient algorithms to find an input yielding the same output value. > Supply the seed as a string that > [Integer.decode|http://docs.oracle.com/javase/7/docs/api/java/lang/Integer.html#decode(java.lang.String)] > can handle. > * 'sip24', [optional seed]: Returns a [64-bit > SipHash-2-4|https://131002.net/siphash/]. SipHash is competitive in > performance with Murmur3, and is simpler and faster than the cryptographic > algorithms below. When used with a seed, it can be considered > cryptographically secure: given the output from a sip24 instance but not the > seed used, we cannot efficiently craft a message yielding the same output > from that instance. > * 'adler32': Returns an Adler-32 checksum (32 hash bits) by delegating to > Java's Adler32 Checksum > * 'crc32': Returns a CRC-32 checksum (32 hash bits) by delegating to Java's > CRC32 Checksum. > * 'md5': Returns an MD5 hash (128 hash bits) using Java's MD5 > MessageDigest. > * 'sha1': Returns a SHA-1 hash (160 hash bits) using Java's SHA-1 > MessageDigest. > * 'sha256': Returns a SHA-256 hash (256 hash bits) using Java's SHA-256 > MessageDigest. > * 'sha512': Returns a SHA-512 hash (160 hash bits) using Java's SHA-512 > MessageDigest. > * 'good-(integer number of bits)': Returns a general-purpose, > non-cryptographic-strength, streaming hash function that produces hash codes > of length at least minimumBits. Users without specific compatibility > requirements and who do not persist the hash codes are encouraged to choose > this hash function. (Cryptographers, like dieticians and fashionistas, > occasionally realize that We've Been Doing it Wrong This Whole Time. Using > 'good-*' lets you track What the Experts From (Milan|NIH|IEEE) Say To > (Wear|Eat|Hash With) this Fall.) Values for this hash will change from run to > run. > Examples: > {code} > define DefaultH datafu.pig.hash.Hasher(); > define GoodH datafu.pig.hash.Hasher('good-32'); > define BetterH datafu.pig.hash.Hasher('good-127'); > define MurmurH32 datafu.pig.hash.Hasher('murmur3-32'); > define MurmurH32A datafu.pig.hash.Hasher('murmur3-32', '0x0'); > define MurmurH32B datafu.pig.hash.Hasher('murmur3-32', '0x56789abc'); > define MurmurH128 datafu.pig.hash.Hasher('murmur3-128'); > define MurmurH128A datafu.pig.hash.Hasher('murmur3-128', '0x0'); > define MurmurH128B datafu.pig.hash.Hasher('murmur3-128', '-12345678'); > define MD5H datafu.pig.hash.Hasher('md5'); > define SHA1H datafu.pig.hash.Hasher('sha1'); > define SHA256H datafu.pig.hash.Hasher('sha256'); > define SHA512H datafu.pig.hash.Hasher('sha512'); > > data_in = LOAD 'input' as (val:chararray); > > data_out = FOREACH data_in GENERATE > DefaultH(val), GoodH(val), BetterH(val), > MurmurH32(val), MurmurH32A(val), MurmurH32B(val), > MurmurH128(val), MurmurH128A(val), MurmurH128B(val), > SHA1H(val), SHA256H(val), SHA512H(val), > MD5H(val) > ; > STORE data_out INTO 'output'; > {code} > In practice: > {code} > -- Consistent shuffle of large dataset with only one full-table reduce > step. > -- Every pig run with the same seed will generate sorted output in the same > order > define MurmurH32 datafu.pig.hash.Hasher('murmur3-32'); > -- Force each file to go in whole to a single mapper (or in the LOAD use > -tagSplit, to be added in future Pig version) > SET mapred.max.split.size 1099511627776; > -- -tagPath option labels each file > data_in = LOAD 'input' USING PigStorage('\t', '-tagPath') AS > (path:chararray, val:chararray); > data_numbered = RANK data_in; > data_ided = FOREACH numbered GENERATE > MurmurH32(CONCAT((chararray)path, '#', (chararray)rank_data_in)) AS > shuffle_key, > val AS val; > data_shuffled = FOREACH (ORDER data_ided BY shuffle_key) GENERATE val; > STORE data_shuffled INTO 'data_shuffled'; > {code} > Important notes about this patch: > * It should be applied _after_ the patch for DATAFU-46 and DATAFU-48. > * -(It expands the dependence on Guava. Does [pull req > 75|https://github.com/linkedin/datafu/pull/75] mean there's momentum to > de-Guava datafu?)- > * -(The patch has (commented out) code that shows what life would be like if > the sip24, crc32 and adler32 hashes were available. On your advice, I will > either (a) put in a patch removing the spurious comments or (b) file a > separate bug to update guava, push in a patch for that, and put in a patch > restoring to glory the extra hashes.)- -- This message was sent by Atlassian JIRA (v6.4.14#64029)