[
https://issues.apache.org/jira/browse/DATAFU-47?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Eyal Allweil updated DATAFU-47:
-------------------------------
Attachment: DATAFU-47-new.patch
I looked at the review board for this issue, and fixed the merge conflicts in
HashTests and addressed the comments that were left. It depends on DATAFU-50,
which was reopened, but I put a new patch there so that we can proceed with
both.
Since I didn't create the review, I can't upload a new diff there, but I've
attached it to the Jira issue, and commented in the review board where
appropriate.
Tests pass, and I've run the content of "hasherTest" on a cluster using the
assembled DataFu jar to make sure that the autojarring of the new Guava version
works properly.
I'll respond to the review board comments later.
> UDF for Murmur3 (and other) Hash functions
> ------------------------------------------
>
> Key: DATAFU-47
> URL: https://issues.apache.org/jira/browse/DATAFU-47
> Project: DataFu
> Issue Type: Improvement
> Reporter: Philip (flip) Kromer
> Labels: Guava, Hash, UDF
> Attachments:
> 0001-DATAFU-47-UDF-for-Murmur3-SipHash-2-4-and-other-Hash-functions.patch,
> 0001-UDF-for-Murmur3-and-other-Hash-functions.patch, DATAFU-47-new.patch
>
>
> Datafu should offer the murmur3 hash.
> The attached patch uses Guava to add murmur3 (a fast hash with good
> statistical properties), SipHash-2-4 (a fast cryptographically secure hash),
> crc32, adler32, md5 and sha.
> From the javadoc:
> * 'murmur3-32', [optional seed] or 'murmur3-128', [optional seed]: Returns a
> [murmur3 hash|https://code.google.com/p/smhasher/] of the given length.
> Murmur3 is fast, with has exceptionally good statistical properties; it's a
> good choice if all you need is good mixing of the inputs. It is _not_
> cryptographically secure; that is, given an output value from murmur3, there
> are efficient algorithms to find an input yielding the same output value.
> Supply the seed as a string that
> [Integer.decode|http://docs.oracle.com/javase/7/docs/api/java/lang/Integer.html#decode(java.lang.String)]
> can handle.
> * 'sip24', [optional seed]: Returns a [64-bit
> SipHash-2-4|https://131002.net/siphash/]. SipHash is competitive in
> performance with Murmur3, and is simpler and faster than the cryptographic
> algorithms below. When used with a seed, it can be considered
> cryptographically secure: given the output from a sip24 instance but not the
> seed used, we cannot efficiently craft a message yielding the same output
> from that instance.
> * 'adler32': Returns an Adler-32 checksum (32 hash bits) by delegating to
> Java's Adler32 Checksum
> * 'crc32': Returns a CRC-32 checksum (32 hash bits) by delegating to Java's
> CRC32 Checksum.
> * 'md5': Returns an MD5 hash (128 hash bits) using Java's MD5
> MessageDigest.
> * 'sha1': Returns a SHA-1 hash (160 hash bits) using Java's SHA-1
> MessageDigest.
> * 'sha256': Returns a SHA-256 hash (256 hash bits) using Java's SHA-256
> MessageDigest.
> * 'sha512': Returns a SHA-512 hash (160 hash bits) using Java's SHA-512
> MessageDigest.
> * 'good-(integer number of bits)': Returns a general-purpose,
> non-cryptographic-strength, streaming hash function that produces hash codes
> of length at least minimumBits. Users without specific compatibility
> requirements and who do not persist the hash codes are encouraged to choose
> this hash function. (Cryptographers, like dieticians and fashionistas,
> occasionally realize that We've Been Doing it Wrong This Whole Time. Using
> 'good-*' lets you track What the Experts From (Milan|NIH|IEEE) Say To
> (Wear|Eat|Hash With) this Fall.) Values for this hash will change from run to
> run.
> Examples:
> {code}
> define DefaultH datafu.pig.hash.Hasher();
> define GoodH datafu.pig.hash.Hasher('good-32');
> define BetterH datafu.pig.hash.Hasher('good-127');
> define MurmurH32 datafu.pig.hash.Hasher('murmur3-32');
> define MurmurH32A datafu.pig.hash.Hasher('murmur3-32', '0x0');
> define MurmurH32B datafu.pig.hash.Hasher('murmur3-32', '0x56789abc');
> define MurmurH128 datafu.pig.hash.Hasher('murmur3-128');
> define MurmurH128A datafu.pig.hash.Hasher('murmur3-128', '0x0');
> define MurmurH128B datafu.pig.hash.Hasher('murmur3-128', '-12345678');
> define MD5H datafu.pig.hash.Hasher('md5');
> define SHA1H datafu.pig.hash.Hasher('sha1');
> define SHA256H datafu.pig.hash.Hasher('sha256');
> define SHA512H datafu.pig.hash.Hasher('sha512');
>
> data_in = LOAD 'input' as (val:chararray);
>
> data_out = FOREACH data_in GENERATE
> DefaultH(val), GoodH(val), BetterH(val),
> MurmurH32(val), MurmurH32A(val), MurmurH32B(val),
> MurmurH128(val), MurmurH128A(val), MurmurH128B(val),
> SHA1H(val), SHA256H(val), SHA512H(val),
> MD5H(val)
> ;
> STORE data_out INTO 'output';
> {code}
> In practice:
> {code}
> -- Consistent shuffle of large dataset with only one full-table reduce
> step.
> -- Every pig run with the same seed will generate sorted output in the same
> order
> define MurmurH32 datafu.pig.hash.Hasher('murmur3-32');
> -- Force each file to go in whole to a single mapper (or in the LOAD use
> -tagSplit, to be added in future Pig version)
> SET mapred.max.split.size 1099511627776;
> -- -tagPath option labels each file
> data_in = LOAD 'input' USING PigStorage('\t', '-tagPath') AS
> (path:chararray, val:chararray);
> data_numbered = RANK data_in;
> data_ided = FOREACH numbered GENERATE
> MurmurH32(CONCAT((chararray)path, '#', (chararray)rank_data_in)) AS
> shuffle_key,
> val AS val;
> data_shuffled = FOREACH (ORDER data_ided BY shuffle_key) GENERATE val;
> STORE data_shuffled INTO 'data_shuffled';
> {code}
> Important notes about this patch:
> * It should be applied _after_ the patch for DATAFU-46 and DATAFU-48.
> * -(It expands the dependence on Guava. Does [pull req
> 75|https://github.com/linkedin/datafu/pull/75] mean there's momentum to
> de-Guava datafu?)-
> * -(The patch has (commented out) code that shows what life would be like if
> the sip24, crc32 and adler32 hashes were available. On your advice, I will
> either (a) put in a patch removing the spurious comments or (b) file a
> separate bug to update guava, push in a patch for that, and put in a patch
> restoring to glory the extra hashes.)-
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)